1

I am trying to read text with tesseract.js in Node.

To speed up the process, I'd like to have it print out the results line by line. If I understand tesseract correctly, it first reads the image, applies some changes, filters etc to make it more readable, detect the lines and then it starts recognizing the text in those lines.

Is there any way how I could get those lines before it finishes recognizing the whole document or would I need to literally split the image and then pass each line as an image to tesseract?

To try this, I split a test image by hand and ran them through tesseract. This speeds up the process (for my specific use-case) by a lot.

Result: https://gyazo.com/3e50e40bb7b2f07190c4a2377b65c92d

var Tesseract = require('tesseract.js')

var myImage = [
        "1.jpg",
        "2.jpg",
        "3.jpg",
        "4.jpg",
        "5.jpg",
        "6.jpg",
        "7.jpg",
        "8.jpg",
        "9.jpg"
]

rec(0)

function rec(number){
    var thisImage = myImage[number];

    Tesseract.recognize(thisImage, {
        lang: "eng"
    })
       .then(function (result) { 
            console.log("[Line: " + number + "] " + result.text.replace(/(\r\n|\n|\r)/gm, ""));
            if(number < 8){
                number++;
                rec(number)
            }
        })  
}

In case I would have to split the image beforehand, which NodeJS library should I use to detect lines? ImageMagick?

I only found a python example from somebody trying to detect lines in an image: Split text lines in scanned document

What would be the right approach for this?

Thomas Weiss
  • 375
  • 1
  • 2
  • 16

0 Answers0