Pose Estimation and Preprocessing for an AI Fencing Referee

4 min readNov 27, 2020

Update October 2021 — The work discussed in this blog has lead to the introduction of Allez Go, the world’s first working AI fencing referee. Learn more in this blog here or at allzgo.com

In the last blog post, we collected our dataset of 2-second clips that are labeled either left touch or right touch. Unfortunately, we can’t feed this directly into a model to start training. Typically with action classification models, some sort of CNN is used on the raw video to extract the features. However, since in our case we are looking for the body features of each fencer we can use a pose estimation model to extract features in place of a CNN.

As you can see, the pose estimator outputs the joints of each person fairly well. However, the pose estimator we’re using estimates the pose for everyone in the frame, and we only want the pose of our 2 fencers. Luckily, it comes with a built-in parameter, number_people_max which does what the name suggests. It takes the two poses that the model is most confident on and outputs only those.

It’s not the perfect workaround since there are cases where the model will detect the wrong person such as a referee. There’s no easy way to fix this other than training a model to detect only fencers from scratch so for now, we’ll just have to filter these out manually.

Other than outputting video files, the pose estimator can also output JSON files of the coordinates of each pose which is what we will be using to train our model. It acts as a great feature extractor, removing the background and leaving only the pose of the fencer’s behind. Getting the pose estimations on our entire dataset takes about 24 hours. Now, each clip is represented by 30 JSON files representing each frame. We’ll convert these to 30 lists to make things easier and convert them back to a NumPy array later.

We’re not done yet, however. The pose estimator outputs the confidence of each joint which we don’t need so we can throw those out. It also outputs feet and head joints which while are cool to look at will likely cause the model to overfit so we can throw those out too. The feet and head are not vital when determining right of way. Another important factor to keep in mind is that for our model, order matters. The pose estimator doesn’t track people across frames meaning that the left fencer might appear first in the list for one frame and then second in the list for another. This doesn’t matter when we’re visualizing the data but it will mess the model up. To fix this, we’ll just say whichever pose is more to the left (has a lesser X-axis value) will appear first in the list. We’ll also do the same for each individual arm and leg. Next, we’ll handle mistakes by the pose estimator model by removing any outliers. We’ll use a general rule of thumb in statistics and say that any point that is more than 1.5 times the IQR away from the 1st and 3rd median is an outlier and should be removed. We’ll also handle the occasional frames like the example above where someone other than the 2 fencers is detected by saying that if the center of the fencers in the current frame is more than 60 pixels from the previous frame’s center, it is probably incorrect. We’ll also do our best to resize the fencers. We’ll first center them vertically in the frame using their median y-value and then expand them so that their height within the frame is consistent (target height/median height).

This is what our preprocessed data looks like using MatPlot.

All of this preprocessing removes ‘noise’ from our data and allows our model to focus only on the fencer’s actions. This way, the model doesn’t have to account for factors like distance from the camera and outliers. Testing shows that this preprocessing allows us to decrease the complexity of our model significantly.

One final addition to our data is that we’ll attach when individual lights go off. Using code similar to the one used to originally collect our data, we’ll detect when the lights go off and append a 1 or a 0 to signify when the lights go off. Since we have no way of detecting the blade, this should make it easier for the model to differentiate between a miss and an actual hit on the fencer.

After that, we’ll standardize our data to fit between 0 and 1 and round this to 3 decimal places to prevent overfitting. Now we finally have data we can fit a model to which will be coming up in the next blog post!

Pose Estimation and Preprocessing for an AI Fencing Referee

Written by Jason Mo