Understanding the output shape of mediapipes tflite models for pose detection

Question

I'm trying to use one of mediapipes pretrained tflite models to perform a pose landmark detection in android (java), which offers me information about 33 landmarks of a human body. I know there are different ways for example using ML Kit, but for better results using one of mediapipes model would be better.

I want to use (https://google.github.io/mediapipe/solutions/models.html) the Pose landmark model.

To use this models in android it's neccessary to know (and especially understand) the output shapes of the model. Those can be read in java (or python):

Five outputs: [195], [1], [256, 256, 1], [64, 64, 1], [117] if I got it right.

But in the model card of the model the output array is defined as [33, 5], what make sense because the goals is to detect 33 landmarks with 5 values each.

Can somebody explain the output shapes of the tflite model and how to use them or give me a clue on a documentation I missed.

The used code is auto-genrated from android studio and offered by the "sample code section" of the model in the ml folder (following those instructions https://www.tensorflow.org/lite/inference_with_metadata/codegen#mlbinding, but getting same shapes on using following approach https://www.tensorflow.org/lite/guide/inference#load_and_run_a_model_in_java):

try {
        PoseDetection model = PoseDetection.newInstance(getApplicationContext());

        // Creates inputs for reference.
        TensorBuffer inputFeature0 = TensorBuffer.createFixedSize(new int[]{1, 224, 224, 3}, DataType.FLOAT32);

        // imageBuffer is the image as ByteBuffer
        inputFeature0.loadBuffer(imageBuffer);

        // Runs model inference and gets result.
        PoseDetection.Outputs outputs = model.process(inputFeature0);
        TensorBuffer outputFeature0 = outputs.getOutputFeature0AsTensorBuffer();
        TensorBuffer outputFeature1 = outputs.getOutputFeature1AsTensorBuffer();
        TensorBuffer outputFeature2 = outputs.getOutputFeature2AsTensorBuffer();
        TensorBuffer outputFeature3 = outputs.getOutputFeature3AsTensorBuffer();
        TensorBuffer outputFeature4 = outputs.getOutputFeature4AsTensorBuffer();

        // Releases model resources if no longer used.
        model.close();
    } catch (IOException e) {
        // TODO Handle the exception
    }

I got the shapes by inspecting the outputFeatures using the debugger.

Thanks a lot.

Welcome to StackOverflow. Without actually showing your code (& possible source image) or providing more detail (i.e. which example/sample project used) it is difficult to know where you are getting your output values. Just by guessing, the first set is for the bounding box and the second set is the Vitruvian man position as described in docs with `we predict the midpoint of a person’s hips, the radius of a circle circumscribing the whole person, and the incline angle of the line connecting the shoulder and hip midpoints.` — Morrison Chang, Jan 21 '23 at 01:52
I edited the question a bit offering some code (the input image shouldnt matter, because the output shapes are predefined). Which docs are you reffering? — EinePriseCode, Jan 21 '23 at 10:58
My quote is from: https://google.github.io/mediapipe/solutions/pose.html#models Do note there is an entire section in the mediapipe documentation for getting started on Android: https://google.github.io/mediapipe/getting_started/android.html — Morrison Chang, Jan 21 '23 at 20:49

score 0 · Accepted Answer · answered Mar 20 '23 at 13:06

After some considerations due to the hint in MediaPipe's documentation "Disclaimer: Running MediaPipe on Windows is experimental." (https://developers.google.com/mediapipe/framework/getting_started/install#installing_on_windows), I followed the instructions on google.github.io/mediapipe/getting_started/android.html as @Morrison Chang proposed. This approach needs much time to understand, but grants high customizability and good results. That solved my problem, the old approach seems to be not suitable.

Understanding the output shape of mediapipes tflite models for pose detection

1 Answers1