2D-Pose-Estimation to 3D-Person Modeling

To better analyze a subject’s body language while looking for pre-assaultive cues, a translation of 2D body keypoints from video input to 3D body keypoints in an internal 3-dimensional space was explored.

The approach taken was to create a 3D person model, with typical bone lengths and joint ranges-of-motion, then perturb, or search the model until a projection of it onto a 2D plane matches that of the 2D keypoints produced by a pose estimator. The resulting 3D model, should match what was filmed.

Here a spiral search pattern was employed, the idea being that once the initial positioning of the 3D model was performed, subsequent movements would be close to their prior positions, being only a single video frame away.

2D Projection Ambiguity

There can be multiple 3D person models that project the same 2D keypoints, the disparity due to lack of depth information in the 2D data. One way to get depth data is via stereovision, so a rudimentary dual-iPhone setup was employed to produce stereo video. Each video stream was processed through the deep_hrnet pose estimator producing left-eye and right-eye keypoints, which can be used to calculate depth information.

The cameras were 25 cm apart, not 25 mm as written on the newspapaer.

While in theory, the two keypoint sets would provide the needed depth data to disambiguate the possible 3D models, in practice, they are not quite accurate enough to do so. The jitter of the keypoints alone is often greater than the differences in left-eye and right-eye data.

Smoothing the pose data revealed enough information to accurately model depth and produce a good model of the arms in this clip.

While (barely) good enough for detecting depth disparities of over 35 cm (14 inches) as above, the pose keypoint data is not accurate enough to detect shoulder rotation by itself. Augmentation with elbow angle and arm bone length heuristics into a voting system provided decent shoulder rotation detection accuracy.

Other pose estimation systems where evaluated such as DEKR, AlphaPose, OpenPose, and Detectron2, but while good, none of them produced the required keypoint accuracy for subtle (<30 cm) stereovision at reasonable (<30 cm) left-eye/right-eye camera separations.

Neural Net 2D to 3D Systems

There are several system that attempt to create 3D models of people from 2D images: VideoPose3D, XNect, HuMoR, Hierarchical Kinematic Probability Distributions for 3D Human Shape.

VideoPose3D was evaluated (since it was a relatively lightweight installation) and produced good results. It also suffers from lack of depth information as shown in the clip below, but otherwise produces a useable model.

Miscellaneous Notes

  • Switched to using scipy.optimize algorithms for minimization of the objective function that penalizes the error between the projected 3D model keypoints and 2D keypoints
  • Tried incorporating the feet pose estimates provided by OpenPose to help with leg orientation disambiguation, but it was of minimal help.
  • Detecting when the foot was planted on the ground or not and using that to restrict possible 3D poses did help.
  • Penalizing the edges of the ranges of motion for each body part did also help a little.

Next Steps

2D-image to 3D person modeling is a very active research area, and while super interesting, not the primary value proposition of HelVision, and so VideoPose3D will be used going forward in as much as it can model desired body language cues, and swapped-out as better systems become available, or are required.

That said, LiDAR technologies have come down greatly in price and may warrant further investigation as to their utility.