|
Multiple Room Occupant Location and Identification
Purpose
|
To effectively interact with the users of the Intelligent Room it is
critical to know where the occupants are. With such information, HCI input
and output can be modified through understanding of the physical context
of each occupant. Examples of this are: using knowledge of an occupants
location to resolve speech ambiguities such as "this" and "that";
displaying information on devices which are near and or visible to the
occupant to whom the information is relevant. Occupant location information
can further be used as input for additional processing, such as providing
a foveal area for a gesture-recognition system, or allowing occupants to
be "objectified" so that they may be labeled or pointed to by
other systems.
|
Task Description
|
To provide a list of three-dimensional coordinates or bounding
boxes for each room occupant, given the current accumulated
knowledge about the room and input from the rooms sensory
devices.
|
Overview of the Current Implementation
|
The current tracking system uses cameras located in two corners of the
room as shown in figure 1.1
who's views are shown in figure 1.2
|
Figure 1.1:
Locations of the two tracking cameras in the HCI room.
These cameras are mounted on the walls about 8 feet above the floor, and are
tiled slightly downward.
|
|
Figure 1.2:
Sample output images from the two tracking cameras.
|
|
The output image from each camera is analyzed by a program which labels, and
identifies a bounding box around, each occupant in the room.
This information is then sent through coordination program which synchronizes the
findings from the individual cameras and combines their output using a neural
network to recover a 3D position for each room occupant.
|
Additional Implementation Features
|
In addition to producing the 3D position of room occupants, the tracking subsystem
controls several other aspects of the room's "behavior". This is done because these
additional behaviors are more effectively implemented as direct parts of this subsystem
than as independentdent systems, thus circumventing the need for interaction with the Brain.
Yielding mechanisms which act very much like reflex behaviors in living creatures.
Currently three such reflexes have been implemented:
- Control of mobile cameras to follow a particular room occupant.
In the current implementation these cameras are mounted above the
stationary tracking cameras, and are controlled by servo motors,
which allow the cameras to rotate laterally. These cameras are shown in
figure 1.3.
- Selection of optimal view of a particular occupant.
Using the 3D position of the occupant who is being followed by the
mobile cameras, this mechanism selects the "best" (as defined
by predetermined room locations) view of that occupant from
the two mobile-camera views and a third stationary camera centered in the room,
as shown in figure 1.4
- Direct output of occupant location to a display program.
This output interface allows for real-time visualization of the location of
each of the rooms occupants. To date, two output programs have been designed
to connect to this interface. One which shows a top down view of the current location
and past locations and another which uses a 3D Inventor display. These are shown in
in figure 1.5
|
Figure 1.3:
One of the dual-camera mounts in the HCI room.
The lower camera, which uses a wide angle lens is used for tracking room occupants.
The upper camera is mounted on a computer controlled servo motor which is controlled
by the tracking subsystem to follow occupants around the room.
|
|
Figure 1.4:
Locations and approximate view frustums of the three cameras
who's views are selected between by the selection mechanism.
|
|
Figure 1.5:
Sample output from programs connected to the visualization interface.
|
|
The Current Implementation
|
Code Files
|
Most of the code used for this implementation is based upon the Vision-Development (visiondev)
library, developed by members of the HCI project. For more information about this library
contact {email-me}
|
-
- Single camera tracking code
- ~jsd/hci/TRACKING.cpp
- contains code which performs tracking of multiple occupants using a single view
- ~jsd/hci/HCI_INTERFACE.cpp
- contains code which allows trackers to pass model information back and forth
- contains interface for communication with the coordination system
-
- Single camera tracking configuration
- ~jsd/hci/TRACKER.cfg
- various parameters which control tracker performance
- ~jsd/hci/BoundingBoxesL.cfg
- Bounding boxes of legal occupant positions in the left camera
- ~jsd/hci/BoundingBoxesR.cfg
- Bounding boxes of legal occupant positions in the right camera
-
- Mobile camera control program
- ~jsd/hci/Follow.cpp
- interfaces the Motorola serial IO chip which controls the servo motors on
the mobile cameras.
-
- Camera coordination and brain interface
- ~jsd/hci/Coordinator.cpp
- Combines output from trackers to produces a 3D occupant position.
- Interfaces with the Brain to output occupant position, and handle request for
entering and exiting occupants.
|
- Compilation:
- make TRACKER from the ~jsd/hci directory
|
Hardware requirements and configurations
|
Currently the Tracking system requires the concurrent execution of 6 programs. A bare
minimum requirement is two SGI Indy stations, due to the single analog line-in limitations of
these systems. However, the system is currently configured to use 3 SGIs, to reduce the
workload on each of these systems. The current systems are:
- raphael.ai.mit.edu
- Coordinator
- brave-heart.ai.mit.edu
- TRACKER -L
(NOTE: The input from the left camera must be fed into
the serial line-in camera connection.)
- Follow -L
(NOTE: The serial-port must be connected to the left servo controller.)
- gameout
- diablo.ai.mit.edu
- TRACKER -R
(NOTE: The input from the right camera must be fed into
the serial line-in camera connection.)
- Follow -R
(NOTE: The serial-port must be connected to the right servo controller.)
|
Bootup sequence
|
- diablo and brave-heart must be logged into with xhost+ set.
- if connection to the brain is desired, the brain must be running before startup
- on raphael: run ~jsd/hci/Coordinator with options
- use [+/-]B to use or ignore the brain
- use [+/-]A # to automatically grab the first # people in the virtual doorway
- use [+/-]T "<TRACKER OPTIONS>" to send options to the trackers
- use [+/-]F to follow the first occupant with the mobile cameras.
|
Algorithms
The Tracker algorithm
|
The purpose of the tracker algorithm is given an image from a single camera,
identify the optimal bounding box for each occupant of the room.
|
Background subtraction for segmentation
|
The principal segmentation method used by this algorithm is background subtraction.
Because of the particular configuration, and general nature of the HCI project,
we can take advantage of the fact that the HCI environment is generally static.
Further we can exploit the fact that the cameras used for tracking are stationary.
Thus, by accumulating an image of what we expect the "true" background of the room to be,
we can subtract this background from the current image to find which pixels differ, and
hence contain non-background objects. An example of such a background image is shown
in figure {VAR FIG_NUM_BG}
|
Figure 1.6:
A sample background image used for segmentation.
|
|
There are several issues which arise within this method:
-
our principal objective is to identify room occupants, not chairs or other mobile items.
Backgrounding should eliminate such objects from consideration.
-
stationary people, or people who are present during start-up tend to present extra difficulties.
A Backgrounding method must be able to "learn" when replace its
previous belief of the background with new information.
-
how does the system set the threshold so that room occupants are above this threshold,
regardless of their local environment.
Shadows cause this problem to be particularly severe, because in shadows the variation
between background and foreground (occupant) is far lower than elsewhere.
|
The current Backgrounding scheme handles these problems using several techniques.
To handle thee first two problems of allowing the background to update to incorporate
image regions which are not believed to be people, background updating is governed
by two mechanisms, one passive and one active.
|
The passive mechanism keeps track of
differences from the background and their longevity. Thus, two backgrounds are
kept, the background which is being used for subtraction and second a background which averages
over previous time-steps which are within threshold. When an input image is acquired
which differs from the the accumulating background by more than the threshold, the count
for those pixels restarts. If the incoming image is within threshold for more
than a certain count of frames, the background which is used for subtraction is updated
with that new value. This has the effect of whole regions of steady pixels being incorporated
into the subtraction background. For example when a chair is moved, it will be detected
in the subtraction image only for a certain number of frames, then the whole chair will
be incorporated into the background. In figure {VAR FIG_NUM_BGACCUM}
examples of the accumulation background and the counter image are shown.
|
Figure 1.7:
The background is updated using an accumulation buffer which keeps track of
steady regions which differ from the background. When a region has been steady
long enough, the background is updated.
|
|
The active mechanism simply suppresses the updating of the background accumulation
buffer within the bounding boxes surrounding room occupants. This suppression only
slows the "steadiness" counter down by a factor, instead of compleatly halting it; this
is done to prevent the protection of accidentally selected regions. For example if a
moved chair is accidently identified as an occupant, suppressing the counter compleatly
would prevent the Backgrounding mechanism from incorporating the chair.
|
To compensate for shadows which cause dramatic changes in the brightness and
chromaticity of occupants as they move around the room, a color correction system is used.
|
Color Correction
|
Color correction for the tracking mechanism occurs in two phases.
|
The first used preacquired knowledge of the lighting variation in the room to correct
the incoming image. This calibration has only been done roughly, yet it seems to do
a good job in correcting for many of the deepest shadows. Unfortunately due to the
dynamic lighting of the room by two projection TV's the usefulness of a static
color correction mechanism is limited. Calibration consists of collecting a time averaged
image of a uniform white (with, ideally, spectrally uniform reflectance) object.
The incoming video stream can be color corrected by normalizing each pixel by its
corresponding pixel in the collected image. One such color correction image is shown
in figure {VAR FIG_NUM_CC}.
|
Figure 1.8:
A calibration image used for color correction.
|
|
The second method of color correction is done by transforming the color space of
the incoming image. Currently the transformation used is (r, g, b) is mapped to
(r/(g+b), g/(r+b), b/(r+g)). Figure {VAR FIG_NUM_COLORNORM} shows a segmented
and color transformed image. This is currently the weakest link in the current
system, the stability of this normalization is less than optimal. This is a
place for great improvement in the current system.
|
Figure 1.9:
Segmented output which has been transformed into a more stable
though still suboptimal color space.
|
|
Occupant detection
|
Given the segmented output from the earlier processes, the remaining task
is to determine bounding boxes around each room occupant, and consistently
label them.
|
To detect regions which are most likely to contain room occupants, then further
attention can be paid to these selected regions to perform consistent labeling.
Candidate regions are selected from a pre calibrated list of legal bounding boxes,
(currently stored in the file HCIcfg/BoundingBoxes*.cfg, where '*' is one of 'L' or 'R').
By predetermining legal bounding boxes, we can compensate for occlusions by
assuming the existence of a full sized person, even though part of their
body may be hidden from view. Further this has the effect of reducing the space
over which we must consider as candidates for occupancy.
For example, we do not need to consider people on the ceiling.
An example of the bounding boxes are shown in figure 1.10
Then the N+2 bounding boxes with the highest density of pixels which differ
from the background are selected. The selected regions are the passed on to
the labeling system.
|
Figure 1.10:
Example of the legal bounding boxes for a room occupant, shown in gray.
Black boxes are partially filled, and green box is best model fit.
|
|
To identify each occupant the tracker algorithm makes use of a time averaged
model of each occupant. This model is initialized when a new occupant announces himself to
the room. The Brain then sends a message to the coordinator, which simply passes the
message on to the trackers to look for a new occupant in the pre-selected virtual
doorway (see the HCIcfg/TRACKER.cfg file for the current settings). When both
trackers signal that they have located the new occupant, they begin tracking.
|
Every several frames the two trackers pass their models back and forth so
that over time they have the same model of the occupant. This is useful because
presumably all sides of the occupant will be observed by both cameras.
Oversized samples (for viewing purposes) are shown in figure 1.11.
|
|
Labeling is performed by exhaustively calculating the error for matching the
preselected boxes with each model.
A process which is O(N!) in the number of room occupants;
however, for small N this is acceptable.
Then the labeling which minimizes the total error is selected.
This labeling acquired independently for each of the two tracking cameras is then
passed onto the Coordination system.
|
The Coordination Algorithm
|
In addition to handling communication with the brain, the Coordinator program
combines the output from the two independent tracking systems to provide a single
3D location for each occupant.
|
Several techniques have been tried to perform this 3D reconstruction:
- Nearest-Neighbor
- Using the known positions of the occupant when the bounding boxes were collected,
the simplest solution was to simply return a 3D position corresponding to the
average location of corresponding to the bounding boxes from each image.
- Linear approximation
- The average of the closest four neighbors are averaged to produce a 3D position,
in the same fashion as above.
- Neural Network
- The bounding box data was fed into a neural network, the result of which
is a function which closely approximates the projective transformation.
- Reversing the projective transformation
- Because the Neural Network solution works well, this has not been implemented,
however, the best solution is to simply find the projective transformation
which best fits the bounding boxes and then invert it to find 3D position.
|
Currently the neural network is used to find the 3D position. Clearly the projective
transformation is a more robust / well founded technique and should at some point be
applied.
|
Occupant Following and View Selection
|
Each tracking camera independently controls its mobile camera simply by setting its
rotation to be the X coordinate of the desired occupant in the tracking image.
|
The optimal view is selected based upon the location of the occupant as determined
by the Coordination system. View selection is MUX controlled from within the coordinator
program. Figure 1.12 shows the regions of the room, which when
entered by an occupant cause a change to the corresponding camera.
The camera positions are shown in figure 1.4.
Notice that when on the right of the room the left follow camera is selected,
because the occupant is
presumably facing into the room, and toward the left camera. When in the "dead zones"
in between the selected view retains its last output to create a steady flow and
prevent rapidly flipping back and forth when the occupant is on a region border.
|
Figure 1.12:
The regions in the room which trigger selection of a particular camera output for
the optimal view of a particular occupant. Useful for applications such as automatic
occupant filming (during a lecture for example).
|
|
Page loaded on December 26, 2024 at 08:11 AM.
Page last modified on
2006-05-27
|
Copyright © 1997-2024, Jeremy S. De Bonet.
All rights reserved.
|
|