Object Detector Training

The Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) pose estimator is top-down, ie, it takes as part of its input the bounding boxes of persons. It derives its person bounding boxes from PyTorch’s Faster R-CNN ResNet-50 FPN pre-trained model. Since we also want to detect other objects of interest, namely weapons, we’ll fine tune that same pre-trained model with person, handgun, rifle, and knife data. This will be a faster performing solution than chaining person detection, then weapons detection as one model will provide both person and weapon objects detection.

Other pose estimators are bottom-up, in that they initially find limb components in the scene, then build the person from them. They are high-performing and are taken into consideration via HelVision’s software systems architecture.

Person data for training comes from the Pattern Analysis, Statistical Modelling and Computation Learning (PASCAL) Visual Objects Classes (VOC) Homepage.

Initial handgun and knife data comes from the Soft Computing and Intelligent Information Systems University of Granada research group.

Rifle data was collected and labelled in-house.

Having completed 25 epochs of training, the initial object detector scores are as follows

meanAP personAP handgunAP rifleAP knifeAP
0.7512 0.7637 0.8628 0.6244 0.7538

AP=average precision

This performance is lacking. Improvements will be made and another round of training attempted. Improvements to make:

  • The Granada weapons data does not label persons in scene, includes a large number of essentially duplicates, and bounding boxes often do not leave any margin around the object of interest, vital for edge detection. It will be necessary to do a manual pass over the data to add missing labels, adjust bounding boxes, and remove duplicates.

  • Collect and label more rifle data.