2

I am using YOLOv8 for object detection in a React app, and I'm having trouble interpreting the output of the model. I have taken the official "yolov8n.pt" model from Ultralytics and converted it to a web model in python like this:

model = YOLO("yolov8n.pt")  # load an official model
model.export(format="tfjs") # Export the model

I have added the resulting model.json to my public folder in my React app. Then, I preprocess an image and make a prediction using the model like this:

const imagePrediction = async () => {
  const model = await loadGraphModel('/yolon-model/model.json');

  const image = document.getElementsByClassName('catImage')[0]
  const tfimg = tf.browser.fromPixels(image).toInt();
  const tfimg_res = tf.image.resizeBilinear(tfimg, [640, 640]);
  const expandedimg = tfimg_res.transpose([0, 1, 2]).expandDims();
  
  const predictions = await model.executeAsync(expandedimg);

  console.log(predictions)
}

The output tensor is of shape 1x84x8400, and there are 80 classes in the official yolov8n model. I'm not sure how to interpret this output and how to extract the bounding boxes and their corresponding classes. Can someone please help me with this? Thank you!

Marcus
  • 29
  • 1
  • 5

2 Answers2

1

Not a complete answer but a start: The output tensor hasn't yet been run through NMS (Non-Max Suppression), which YOLO usually does automatically. This script seems to load an ONNX model (which has the same output format), and convert the results into boxes with associated scores:

https://github.com/ultralytics/ultralytics/blob/main/examples/YOLOv8-OpenCV-ONNX-Python/main.py

So you should just need to perform the same operations on the output tensor as this script does. I'll try it myself later to verify that it works.

Note that this script depends on opencv to perform transposition and NMS.

Amber Cahill
  • 106
  • 8
0

Please follow this article: https://hackernoon.com/deploy-computer-vision-models-with-triton-inference-server

For YOLOv5:

  • output dims [ 25200, 85 ], according to yolo output shape.
  • Model has 80 classes + 4 box + 1 object confidence level outputs at each anchor, and there are 25200 anchors per image, so if you have for example 4 classes to detect, you should change 85 to 9.