Why are there negative coordinate in the normalised object detection results? (CoreML,Vision,Swift, Ios)

Question

I compiled the example.

https://developer.apple.com/documentation/vision/recognizing_objects_in_live_capture

It did not work correctly for me on an iPhone 7 Plus. The rectangles drawn did not cover the items detected.

I created an app of my own to investigate. The detected objects are returned as normalized bounds. However the bounds can be negative in the Y direction. Adding a correction of 0.2 brings them back into alignment.

The detection appears to be cropping a square from the center of the portrait frame to do the detection. I created a square overlay and when the object moves out of the square either to the top or bottom the detection stops. Top and bottom of the square are 0 and 1.0 in the normalised coordinate.

The test App passes the data from captureOutput to an VNImageRequestHandler. The code that sets up the request is also below. Any idea why the observations are sometimes negative in the Y direction? Why do I need to add an offset to bring them back into the unit square and align them with the image?

I have set the camera to 4K in my test app. Not yet tried any other settings.

    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
            return
        }

        //let exifOrientation = exifOrientationFromDeviceOrientation()
        let exifOrientation = CGImagePropertyOrientation.up
        let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: exifOrientation, options: [:])
        do {
            try imageRequestHandler.perform(self.requests)
        } catch {
            print(error)
        }
    }

@discardableResult
func setupVision() -> NSError? {
    // Setup Vision parts
    let error: NSError! = nil

    guard let modelURL = Bundle.main.url(forResource: "ResistorModel", withExtension: "mlmodelc") else {
        return NSError(domain: "VisionObjectRecognitionViewController", code: -1, userInfo: [NSLocalizedDescriptionKey: "Model file is missing"])
    }
    do {
        let visionModel = try VNCoreMLModel(for: MLModel(contentsOf: modelURL))
        let objectRecognition = VNCoreMLRequest(model: visionModel, completionHandler: { (request, error) in
            DispatchQueue.main.async(execute: {
                // perform all the UI updates on the main queue
                if let results = request.results {
                    self.drawVisionRequestResults(results)
                }
            })
        })
        self.requests = [objectRecognition]
    } catch let error as NSError {
        print("Model loading went wrong: \(error)")
    }

    return error
}



    func drawVisionRequestResults(_ results: [Any]) {
        var pipCreated = false
        CATransaction.begin()
        CATransaction.setValue(kCFBooleanTrue, forKey: kCATransactionDisableActions)
        detectionOverlay.sublayers = nil // remove all the old recognized objects
        for observation in results where observation is VNRecognizedObjectObservation {
            guard let objectObservation = observation as? VNRecognizedObjectObservation else {
                continue
            }
            // Select only the label with the highest confidence.
            let topLabelObservation = objectObservation.labels[0]
            if topLabelObservation.identifier == "resistor" {
                if (objectObservation.boundingBox.minX < 0.5) && (objectObservation.boundingBox.maxX > 0.5) && (objectObservation.boundingBox.minY < 0.3) && (objectObservation.boundingBox.maxY > 0.3) {
                    //print(objectObservation.boundingBox.minX)
                    //print(objectObservation.boundingBox.minY)

                    let bb = CGRect(x: objectObservation.boundingBox.minX, y:0.8 -  objectObservation.boundingBox.maxY, width: objectObservation.boundingBox.width, height: objectObservation.boundingBox.height)
                    //let bb = CGRect(x: 0.5,y: 0.5,width: 0.5,height: 0.5)
                        //let objectBounds = VNImageRectForNormalizedRect(bb, 500, 500)
                    let objectBounds = VNImageRectForNormalizedRect(bb, Int(detectionOverlay.bounds.width), Int(detectionOverlay.bounds.width))

//                    print(objectBounds)
//                    print(objectBounds.minX)
//                    print(objectBounds.minY)
//                    print(objectBounds.width)
//                    print(objectBounds.height)

                    print(objectObservation.boundingBox)
//                    print(objectBounds.minX)
//                    print(objectBounds.minY)
//                    print(objectBounds.width)
//                    print(objectBounds.height)

                    let textLayer = self.createTextSubLayerInBounds(objectBounds,
                                                                    identifier: topLabelObservation.identifier,
                                                                    confidence: topLabelObservation.confidence)

                    let shapeLayer = self.createRoundedRectLayerWithBounds(objectBounds)

                    shapeLayer.addSublayer(textLayer)
                    detectionOverlay.addSublayer(shapeLayer)

                    if !pipCreated {
                        pipCreated = true
                        let pip = Pip(imageBuffer: self.imageBuffer!)
                        if self.pip {
                            pipView.image = pip?.uiImage
                        } else {
                            pipView.image = nil
                        }
                    }
                }
            }
        }
        CATransaction.commit()
        doingStuff = false
    }

did you create your own model using turicreate? can you show the `drawVisionRequests` code? have you tested your code using coremltools in python with the same images to see if it is Vision or the model returning the negative y coordinates? you can also try just using CoreML instead of a Vision wrapper to test the last idea out. — ɯɐɹʞ, May 02 '19 at 17:36
The model I am using is created with turiCreate. One also came with the original sample. I must admit I have not tested if the fault with the original sample is exactly the same issue. However it did not work for me correctly, out of the box. As part of the app development I will be adding code to take an image. I'll let you know how I get on. I can also load the model into MacOS app "RectLabel" and it appears to be lined up correctly. — William J Bagshaw, May 02 '19 at 18:38
This probably happens because the model works on a partial crop of the original input image, and so the results are relative to that crop. You need to do your own math to convert it back to screen coordinates (because that totally depends on how your app does things). As for negative x/y coordinates on the predicted bounding box, that can happen. It just means the model thinks the the center of the detected object is off-screen. — Matthijs Hollemans, May 02 '19 at 18:57
@MatthijsHollemans I'm adding an arbitrary constant value to correct the negativeness. I tried objects of different heights. The offset seams to be constant. Not due to a confusing between corner/center coords. I'm adding 0.2 for no obvious reason. The actual objects being detected are from the center square crop. I know this as object outside this region are not detected. Objects that straddle the crop are only as high as the bit inside the crop. So its not detecting outside the center crop region. I getting offset normalised values that can be negative. Where has the 0.2 come from? — William J Bagshaw, May 02 '19 at 19:17
That 0.2 is probably because the way you're displaying the results on the screen uses a different coordinate system and/or aspect ratio than the image your model is working on. Also note that the images coming from the camera have their own coordinate system and aspect ratio. You need to properly translate between these three things (camera image, Core ML image, display image). — Matthijs Hollemans, May 03 '19 at 08:03

score 0 · Answer 1 · answered May 02 '19 at 22:01

I'm not sure why it behaved as it did. However I would like it to have used the whole image to do the object detection and the results to be bound boxes normalised to the original portrait input. Note also the model was trained in this way.

There is a thread https://github.com/apple/turicreate/issues/1016 covering this exact issue. The example does not work and it does not work when you change the model.

The solution, towards the end of the post, says to use...

objectRecognition.imageCropAndScaleOption = .scaleFill

This made the detection use the whole image and produced bound boxes that were normalised to the whole image. No more arbitrary offset. It may be that the training geometry and the detection geometry has to be the same for it to calculate the bound box correctly. However I'm not sure why.

Why are there negative coordinate in the normalised object detection results? (CoreML,Vision,Swift, Ios)

1 Answers1