Object Tracking Extension

Object detection can sometimes be sporadic, leading to gaps in detection across frames. This complicates object tracking. Of specific interest is multi-object tracking (MOT) in online mode, ie, only the current and prior frames are available, with infinite additional frames expected. Batch MOT works on a finite video files.

Reviewed the non-deep-learning MOT methods listed as related work in the FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking paper, specifically the IOU Tracker and found a further refinement to it, the V-IOU Tracker and code. Non-deep-learning MOT was chosen to help keep frame processing rates up, a decision that may be revisited later if necessary.

From the V-IOU Tracker paper, regarding the intersection over union (IOU) tracker:

A major drawback of this simple approach is its requirement for a high recall of the underlying detector. Every gap caused by a single or few missing detections leads not only to false negatives but also to the termination and restart of the track, causing high rates of fragmentation and ID switches.

The V-IOU tracker continues the tracking of objects utilizing a visual tracker (such as those found in OpenCV) for a specified number of additional frames when no new detections can be associated, and thus can fill the gaps between detections.

The following result was achieved, using:

  • For persons

    • The initial center-velocity-projection scheme to track persons was used. (This box-center scheme will be improved by taking pose information into account.) A problem with using loose IOU conditions with persons at the range of interest (2-15 meters) is that people are basically the same size and orientation (compared to weapons), so if one person is mostly behind the other, with enough sticking out for detection, IOU will often deem them the same person.
  • For non-persons

    • The basic V-IOU approach concept was used
    • Plus, a scheme utilizing lower-score detections after a high-score detection
  • A plus sign (+) before the object detection score indicates extension of tracking via visual tracker
  • Green plus signs are box-centers of detected persons
  • Blue plus signs are velocity-projected person box centers
  • Black diamonds are person keypoints

Results are promising; more improvements need to be made.



Portrait-mode input video was captured then cut in iMovie and exported, producing a landscape viewport (16:9) with black boxes on the sides of the video. The black boxes interfered with object detection and produced weak detection results. The landscape video was cropped to its original viewport (via Photos) and object detection accuracy returned to best levels.