Tesserocr did not recognize text

Question

I want to ask for suggestion on how to solve the problem of tesserocr did not recognize certain line from an image.

This is the image. source is from Simple Digit Recognition OCR in OpenCV-Python

The code

from PIL import Image
from tesserocr import PyTessBaseAPI, RIL

image = Image.open('test3.png')
with PyTessBaseAPI() as api:
    api.SetImage(image)
    boxes = api.GetComponentImages(RIL.TEXTLINE, True)
    print 'Found {} textline image components.'.format(len(boxes))
    for i, (im, box, _, _) in enumerate(boxes):
        api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
        ocrResult = api.GetUTF8Text()
        conf = api.MeanTextConf()
        result = (u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
            "confidence: {1}, text: {2}").format(i, conf, ocrResult, **box)

    print result

The result is like this

Found 5 textline image components.
Box[0]: x=10, y=5, w=582, h=29, confidence: 81, text: 9821480865132823066470938


Box[1]: x=9, y=55, w=581, h=30, confidence: 91, text: 4460955058223172535940812


Box[2]: x=10, y=106, w=575, h=30, confidence: 90, text: 8481117450284102701938521


Box[3]: x=12, y=157, w=580, h=30, confidence: 0, text:
Box[4]: x=11, y=208, w=581, h=30, confidence: 89, text: 6442881097566593344612847

It did not recognize the number in box 3. What should I add or modify the script so the box 3 will show the proper result?

Thank you for your help.

thewaywewere · Accepted Answer · 2017-03-30T06:58:50.220

3

It's correctly recognized with Tesseract 4.00.00alpha with default psm 3 and oem 3 modes. Below is the result.

Suggest to upgrade tesseract to v4.0 with your tesserocr if you are still using v3.x.

EDIT:

To upgrade tesserocr to support v4.00.00.alpha, check this "Is any plan to porting tesseract 4.0 (alpha)" issue page. There are guidelines to make it works.

edited Mar 30 '17 at 06:58

answered Mar 29 '17 at 16:01

thewaywewere

8,128
11
41
46

Thank you for your suggestion but can you elaborate more on how do I upgrade `tesseract` to `v4.0` with `tesserocr` ? I installed `tesserocr` version `2.1.3` by `pip install tesserocr` on `python virtual environment`. How should I proceed? – Fang Mar 30 '17 at 02:13
@Fang see EDIT in the Answer about the upgrade. If my reply helps and no more question, you can tick the answer to close the question. – thewaywewere Mar 30 '17 at 05:13
Sorry for late reply. I followed instructions from answer provided from the link. I `git clone` the repo, `cd` into it, `git checkout tesseract4` and do `pip install .` However, it still did not fixed my problem when I run above script. Do I need to add other dependencies/modify the script more? – Fang Mar 30 '17 at 14:34
Have you verified you have tesseract `4.00.00alpha`? What it returns when typed 1) `import tesserocr` 2) `from PIL import Image` 3) `print tesserocr.tesseract_version()`? – thewaywewere Mar 31 '17 at 05:29
it's version 3.05. So in order to have tesseract 4.00alpha, I cannot just do `brew install tesseract` . Am I right? I am on OS X El Capitan by the way. Did try to search about that but failed. I am sorry to ask more if you also happen to know a way to upgrade it on OS X El Capitan. Thank you. – Fang Apr 01 '17 at 14:50
@Fang Upgrade to `tesseract 4.00.00alpha` and `tesserocr 2.2.0-beta` is problematic. You may check my alternative in the 2nd Answer, – thewaywewere Apr 02 '17 at 12:03

thewaywewere · Answer 2 · 2017-04-02T13:13:23.513

Have come out below code with correct OCR result but without x,y,w,h and confidence info.

import tesserocr
from PIL import Image

print tesserocr.tesseract_version()  # print tesseract-ocr version

image = Image.open('SO_5TextLines.png')

lines = tesserocr.image_to_text(image)  # print ocr text from image
for line in lines.split("\r"):
    print line

Output:

tesseract 3.05.00
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.27 : libtiff 4.0.6 : zlib 1.2.8 : libopenjp2 2.1.2

9821480865132823066470938
4460955058223172535940812
8481117450284102701938521
1055596446229489549303819
6442881097566593344612847

Have run your code in OSX Sierra and got the same result with line 4 missed. It looks like the problem is caused in api.SetRectangle(), you may modify your code to print boxes to further check. The sample code is just based on the sample text image you provided, it needs to test with more images to verify if it fits all.

Hope this works for you.

thank you for your help, sorry for late reply. Your second answer indeed is for sample text image I provided and some other text image with clear number or alphabet. I did try with a screenshot image from my screen and the result is not good. Although it did print all lines, some of the word and symbol is not understandable. Thank you for your help — Fang, May 03 '17 at 03:44

Tesserocr did not recognize text

2 Answers2