Converting a Vision VNTextObservation to a String

Question

I'm looking through the Apple's Vision API documentation and I see a couple of classes that relate to text detection in UIImages:

1) class VNDetectTextRectanglesRequest

2) class VNTextObservation

It looks like they can detect characters, but I don't see a means to do anything with the characters. Once you've got characters detected, how would you go about turning them into something that can be interpreted by NSLinguisticTagger?

Here's a post that is a brief overview of Vision.

Thank you for reading.

@ZaidPathan Not yet. These classes seem like they have a ton of potential if they can tie this end. — Adrian, Aug 19 '17 at 13:15
Having difficulties to even detect the words on an ID card (e.g. passport) with 100% accuracy. Couldn't imagine how difficult it would be to convert it to text =.='' — user1872384, Jan 22 '18 at 08:03
Looks like this is coming in iOS 13! https://www.hackingwithswift.com/example-code/vision/how-to-use-vnrecognizetextrequests-optical-character-recognition-to-detect-text-in-an-image — Adrian, Jun 07 '19 at 16:51

score 15 · Answer 1 · edited Dec 27 '17 at 08:56

This is how to do it ...

    //
//  ViewController.swift
//


import UIKit
import Vision
import CoreML

class ViewController: UIViewController {

    //HOLDS OUR INPUT
    var  inputImage:CIImage?

    //RESULT FROM OVERALL RECOGNITION
    var  recognizedWords:[String] = [String]()

    //RESULT FROM RECOGNITION
    var recognizedRegion:String = String()


    //OCR-REQUEST
    lazy var ocrRequest: VNCoreMLRequest = {
        do {
            //THIS MODEL IS TRAINED BY ME FOR FONT "Inconsolata" (Numbers 0...9 and UpperCase Characters A..Z)
            let model = try VNCoreMLModel(for:OCR().model)
            return VNCoreMLRequest(model: model, completionHandler: self.handleClassification)
        } catch {
            fatalError("cannot load model")
        }
    }()

    //OCR-HANDLER
    func handleClassification(request: VNRequest, error: Error?)
    {
        guard let observations = request.results as? [VNClassificationObservation]
            else {fatalError("unexpected result") }
        guard let best = observations.first
            else { fatalError("cant get best result")}

        self.recognizedRegion = self.recognizedRegion.appending(best.identifier)
    }

    //TEXT-DETECTION-REQUEST
    lazy var textDetectionRequest: VNDetectTextRectanglesRequest = {
        return VNDetectTextRectanglesRequest(completionHandler: self.handleDetection)
    }()

    //TEXT-DETECTION-HANDLER
    func handleDetection(request:VNRequest, error: Error?)
    {
        guard let observations = request.results as? [VNTextObservation]
            else {fatalError("unexpected result") }

       // EMPTY THE RESULTS
        self.recognizedWords = [String]()

        //NEEDED BECAUSE OF DIFFERENT SCALES
        let  transform = CGAffineTransform.identity.scaledBy(x: (self.inputImage?.extent.size.width)!, y:  (self.inputImage?.extent.size.height)!)

        //A REGION IS LIKE A "WORD"
        for region:VNTextObservation in observations
        {
            guard let boxesIn = region.characterBoxes else {
                continue
            }

            //EMPTY THE RESULT FOR REGION
            self.recognizedRegion = ""

            //A "BOX" IS THE POSITION IN THE ORIGINAL IMAGE (SCALED FROM 0... 1.0)
            for box in boxesIn
            {
                //SCALE THE BOUNDING BOX TO PIXELS
                let realBoundingBox = box.boundingBox.applying(transform)

                //TO BE SURE
                guard (inputImage?.extent.contains(realBoundingBox))!
                    else { print("invalid detected rectangle"); return}

                //SCALE THE POINTS TO PIXELS
                let topleft = box.topLeft.applying(transform)
                let topright = box.topRight.applying(transform)
                let bottomleft = box.bottomLeft.applying(transform)
                let bottomright = box.bottomRight.applying(transform)

                //LET'S CROP AND RECTIFY
                let charImage = inputImage?
                    .cropped(to: realBoundingBox)
                    .applyingFilter("CIPerspectiveCorrection", parameters: [
                        "inputTopLeft" : CIVector(cgPoint: topleft),
                        "inputTopRight" : CIVector(cgPoint: topright),
                        "inputBottomLeft" : CIVector(cgPoint: bottomleft),
                        "inputBottomRight" : CIVector(cgPoint: bottomright)
                        ])

                //PREPARE THE HANDLER
                let handler = VNImageRequestHandler(ciImage: charImage!, options: [:])

                //SOME OPTIONS (TO PLAY WITH..)
                self.ocrRequest.imageCropAndScaleOption = VNImageCropAndScaleOption.scaleFill

                //FEED THE CHAR-IMAGE TO OUR OCR-REQUEST - NO NEED TO SCALE IT - VISION WILL DO IT FOR US !!
                do {
                    try handler.perform([self.ocrRequest])
                }  catch { print("Error")}

            }

            //APPEND RECOGNIZED CHARS FOR THAT REGION
            self.recognizedWords.append(recognizedRegion)
        }

        //THATS WHAT WE WANT - PRINT WORDS TO CONSOLE
        DispatchQueue.main.async {
            self.PrintWords(words: self.recognizedWords)
        }
    }

    func PrintWords(words:[String])
    {
        // VOILA'
        print(recognizedWords)

    }

    func doOCR(ciImage:CIImage)
    {
        //PREPARE THE HANDLER
        let handler = VNImageRequestHandler(ciImage: ciImage, options:[:])

        //WE NEED A BOX FOR EACH DETECTED CHARACTER
        self.textDetectionRequest.reportCharacterBoxes = true
        self.textDetectionRequest.preferBackgroundProcessing = false

        //FEED IT TO THE QUEUE FOR TEXT-DETECTION
        DispatchQueue.global(qos: .userInteractive).async {
            do {
                try  handler.perform([self.textDetectionRequest])
            } catch {
                print ("Error")
            }
        }

    }

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.

        //LETS LOAD AN IMAGE FROM RESOURCE
        let loadedImage:UIImage = UIImage(named: "Sample1.png")! //TRY Sample2, Sample3 too

        //WE NEED A CIIMAGE - NOT NEEDED TO SCALE
        inputImage = CIImage(image:loadedImage)!

        //LET'S DO IT
        self.doOCR(ciImage: inputImage!)


    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }
}

You'll find the complete project here included is the trained model !

How did you trained the model? TensorFlor? – ricardopereira May 14 '18 at 08:47 — ricardopereira, May 14 '18 at 08:47

score 15 · Accepted Answer · answered Jun 11 '19 at 04:22

Apple finally updated Vision to do OCR. Open a playground and dump a couple of test images in the Resources folder. In my case, I called them "demoDocument.jpg" and "demoLicensePlate.jpg".

The new class is called VNRecognizeTextRequest. Dump this in a playground and give it a whirl:

import Vision

enum DemoImage: String {
    case document = "demoDocument"
    case licensePlate = "demoLicensePlate"
}

class OCRReader {
    func performOCR(on url: URL?, recognitionLevel: VNRequestTextRecognitionLevel)  {
        guard let url = url else { return }
        let requestHandler = VNImageRequestHandler(url: url, options: [:])

        let request = VNRecognizeTextRequest  { (request, error) in
            if let error = error {
                print(error)
                return
            }

            guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

            for currentObservation in observations {
                let topCandidate = currentObservation.topCandidates(1)
                if let recognizedText = topCandidate.first {
                    print(recognizedText.string)
                }
            }
        }
        request.recognitionLevel = recognitionLevel

        try? requestHandler.perform([request])
    }
}

func url(for image: DemoImage) -> URL? {
    return Bundle.main.url(forResource: image.rawValue, withExtension: "jpg")
}

let ocrReader = OCRReader()
ocrReader.performOCR(on: url(for: .document), recognitionLevel: .fast)

There's an in-depth discussion of this from WWDC19

Note: This class is available for iOS 13.0 Beta+ – Newsonic Aug 04 '19 at 19:05 — Newsonic, Aug 04 '19 at 19:05
It only supports en-US currently. – NSDeveloper Dec 30 '19 at 08:17 — NSDeveloper, Dec 30 '19 at 08:17

brian.clear · Answer 3 · 2017-06-14T16:58:04.987

13

SwiftOCR

I just got SwiftOCR to work with small sets of text.

https://github.com/garnele007/SwiftOCR

uses

https://github.com/Swift-AI/Swift-AI

which uses NeuralNet-MNIST model for text recognition.

TODO : VNTextObservation > SwiftOCR

Will post example of it using VNTextObservation once I have it one connected to the other.

OpenCV + Tesseract OCR

I tried to use OpenCV + Tesseract but got compile errors then found SwiftOCR.

SEE ALSO : Google Vision iOS

Note Google Vision Text Recognition - Android sdk has text detection but also has iOS cocoapod. So keep an eye on it as should add text recognition to the iOS eventually.

https://developers.google.com/vision/text-overview

//Correction: just tried it but only Android version of the sdk supports text detection.

https://developers.google.com/vision/text-overview

If you subscribe to releases: https://libraries.io/cocoapods/GoogleMobileVision

Click SUBSCRIBE TO RELEASES you can see when TextDetection is added to the iOS part of the Cocoapod

edited Jun 14 '17 at 16:58

answered Jun 14 '17 at 14:48

brian.clear

5,277
2
41
62

Haven't had a chance to mess with it, but I think you're onto something. You could just grab the rect and OCR the sub-image. https://stackoverflow.com/a/42497332/4475605 – Adrian Jun 15 '17 at 03:27
the Google Vision OCR is in beta only accessible over REST from ios. not included in the IOS SDK. https://cloud.google.com/vision/docs/ocr – brian.clear Jun 21 '17 at 12:37
Were you able to get VNTextObservation connected up with SwiftOCR? – Jordan H Aug 03 '17 at 00:25
i see MS Cognitive services can now read text in images https://azure.microsoft.com/en-gb/services/cognitive-services/computer-vision/ – brian.clear Aug 08 '17 at 11:48
Ok, so I was just looking into this and it seem iOS support is available now: https://developers.google.com/vision/ios/text-overview – Hammad Tariq Apr 29 '18 at 08:45

score 10 · Answer 4 · answered Jun 19 '17 at 08:41

Adding my own progress on this, if anyone have a better solution:

I've successfully drawn the region box and character boxes on screen. The vision API of Apple is actually very performant. You have to transform each frame of your video to an image and feed it to the recogniser. It's much more accurate than feeding directly the pixel buffer from the camera.

 if #available(iOS 11.0, *) {
            guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {return}

            var requestOptions:[VNImageOption : Any] = [:]

            if let camData = CMGetAttachment(sampleBuffer, kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, nil) {
                requestOptions = [.cameraIntrinsics:camData]
            }

            let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer,
                                                            orientation: 6,
                                                            options: requestOptions)

            let request = VNDetectTextRectanglesRequest(completionHandler: { (request, _) in
                guard let observations = request.results else {print("no result"); return}
                let result = observations.map({$0 as? VNTextObservation})
                DispatchQueue.main.async {
                    self.previewLayer.sublayers?.removeSubrange(1...)
                    for region in result {
                        guard let rg = region else {continue}
                        self.drawRegionBox(box: rg)
                        if let boxes = region?.characterBoxes {
                            for characterBox in boxes {
                                self.drawTextBox(box: characterBox)
                            }
                        }
                    }
                }
            })
            request.reportCharacterBoxes = true
            try? imageRequestHandler.perform([request])
        }
    }

Now I'm trying to actually reconize the text. Apple doesn't provide any built in OCR model. And I want to use CoreML to do that, so I'm trying to convert a Tesseract trained data model to CoreML.

You can find Tesseract models here: https://github.com/tesseract-ocr/tessdata and I think the next step is to write a coremltools converter that support those type of input and output a .coreML file.

Or, you can link to TesseractiOS directly and try to feed it with your region boxes and character boxes you get from the Vision API.

Did you have any success converting Tesseract to a Core ML model? — user3746428, Jul 30 '17 at 22:55
Any progress on this? I was looking into this, and I might just end up using vision api to find the characters, and then somehow feed that to the tesseract iOS SDK. I would rather use coreml for the support/quickness, but I might have to settle — Tyler Kelly, Aug 10 '17 at 15:17
with this code how could I have a rectangle box in the centre of the screen and only text in that area gets detected with a box around it? anything outside the rectangle doesn't have a box? — Tony Merritt, Sep 07 '17 at 22:35
Any progress on detecting actual characters/string in rectangle boxes with CoreML? Thanks — Libor Zapletal, Oct 10 '17 at 14:41
Hi there! Are there any build in tools to detect actual characters / strings in iOS 12 with CoreML 2? — user3191334, Aug 14 '18 at 19:06

nathan · Answer 5 · 2017-06-14T07:23:23.617

Thanks to a GitHub user, you can test an example: https://gist.github.com/Koze/e59fa3098388265e578dee6b3ce89dd8

- (void)detectWithImageURL:(NSURL *)URL
{
    VNImageRequestHandler *handler = [[VNImageRequestHandler alloc] initWithURL:URL options:@{}];
    VNDetectTextRectanglesRequest *request = [[VNDetectTextRectanglesRequest alloc] initWithCompletionHandler:^(VNRequest * _Nonnull request, NSError * _Nullable error) {
        if (error) {
            NSLog(@"%@", error);
        }
        else {
            for (VNTextObservation *textObservation in request.results) {
//                NSLog(@"%@", textObservation);
//                NSLog(@"%@", textObservation.characterBoxes);
                NSLog(@"%@", NSStringFromCGRect(textObservation.boundingBox));
                for (VNRectangleObservation *rectangleObservation in textObservation.characterBoxes) {
                    NSLog(@" |-%@", NSStringFromCGRect(rectangleObservation.boundingBox));
                }
            }
        }
    }];
    request.reportCharacterBoxes = YES;
    NSError *error;
    [handler performRequests:@[request] error:&error];
    if (error) {
        NSLog(@"%@", error);
    }
}

The thing is, the result is an array of bounding boxes for each detected character. From what I gathered from Vision's session, I think you are supposed to use CoreML to detect the actual chars.

Recommended WWDC 2017 talk: Vision Framework: Building on Core ML (haven't finished watching it either), have a look at 25:50 for a similar example called MNISTVision

Here's another nifty app demonstrating the use of Keras (Tensorflow) for the training of a MNIST model for handwriting recognition using CoreML: Github

tried SwiftOCR - not very impressed - had trouble with the sample image to string included in the app so would be worse on a image with text it hadnt been trained on. The singularity is on hold till next week! :) — brian.clear, Jun 15 '17 at 16:33
see comment above found Google Vision OCR - in beta. accessible over REST. Not in ios sdk yet. https://cloud.google.com/vision/docs/ocr — brian.clear, Jun 21 '17 at 12:38

score 2 · Answer 6 · answered Nov 13 '17 at 11:21

I'm using Google's Tesseract OCR engine to convert the images into actual strings. You'll have to add it to your Xcode project using cocoapods. Although Tesseract will perform OCR even if you simply feed the image containing texts to it, the way to make it perform better/faster is to use the detected text rectangles to feed pieces of the image that actually contain text, which is where Apple's Vision Framework comes in handy. Here's a link to the engine: Tesseract OCR And here's a link to the current stage of my project that has text detection + OCR already implemented: Out Loud - Camera to Speech Hope these can be of some use. Good luck!

score 2 · Answer 7 · answered Jun 29 '18 at 07:03

For those still looking for a solution I wrote a quick library to do this. It uses both the Vision API and Tesseract and can be used to achieve the task the question describes with one single method:

func sliceaAndOCR(image: UIImage, charWhitelist: String, charBlackList: String = "", completion: @escaping ((_: String, _: UIImage) -> Void))

This method will look for text in your image, return the string found and a slice of the original image showing where the text was found

score 2 · Answer 8 · answered Aug 07 '18 at 15:14

2

Firebase ML Kit does it for iOS (and Android) with their on-device Vision API and it outperforms Tesseract and SwiftOCR.

answered Aug 07 '18 at 15:14

Foti Dim

1,303
13
19

does This support IOS? if yes then please suggest reference link or sample, it fine for me to go ahead. Thanks in advance. – Dhaval Bhadania Aug 24 '18 at 11:26
Read my answer and you will find what you need. Remember to upvote. – Foti Dim Aug 24 '18 at 13:45

Converting a Vision VNTextObservation to a String

8 Answers8

Linked