Body Motion Classification

Given a 3D skeletal model of the subject (as provided by systems such as VideoPose3D), we can infer some body language via simple inspection over time. These cases include:

  • Subject looking at viewer
  • Subject staring at viewer
  • Subject adopted a bladed pose
  • Subject’s hands at-the-ready
  • Subject approaching, etc

Other body language of interest such as the drawing of a weapon, spitting, threatening gestures, etc, while having an easily identifiable signature to a human viewer, are less well suited for classification via algorithmic inspection of skeletal models.

To explore the handling of these more complicated cases, a spatiotemporal aware, graph convolutional network was trained to classify between harmless conversational gesticulation and the drawing of a handgun from concealment, given 3D skeletal subject representations.

  • The body motion of interest typically occurs over a period of a couple seconds so the network needs to be both spatially and temporally aware,
  • And the nature of human skeletons, specifically their 3D joint locations, lends itself well to graph convolutional networks

The paper Quo Vadis, Skeleton Action Recognition? which evaluates several systems led to choosing the Skeleton-Based Action Recognition with Shift Graph Convolutional Network or simply Shift-GCN for its performance and small size.

The network was trained with 160 video clips of a person explaining something with some gesticulation and 160 video clips of a person drawing a handgun from concealment (both inside and outside the waistband, under clothing). 80% of the clips were used for training, 20% for testing. After 100 epochs an accuracy of 95.3% was achieved.

The following video shows an example of the handgun-draw body motion classification.

Discussion

Once the going-for-a-concealed-item action has been detected, several things could happen including drawing/aiming of the viewer’s own weapon, or the issuing of verbal commands/warnings (although the time frame in the example above may be too short if the subject has already decided he’s going to shoot), or perhaps an evasive action.

While the action of drawing a handgun quickly for the purpose of “getting the drop” on someone or getting a shot off quickly is quite different from someone just retrieving their cell phone, waiting for visual confirmation of the weapon would be needed before a shooting response.

An automated system, armed with a taser or other non-lethal round would be well-suited to handle the verbal command -> visual confirmation -> shooting response escalation path, given the timeframe involved.