4

I have an image of text, where the words are outlined rather than filled in. Tesseract is struggling to get any of the words correct - does anyone have a solution to these types of problems?

I have tried simple operations like inversion, but to no affect. I'm guessing tesseract already handles this.

Img example: enter image description here enter image description here
Typical output for Next: New
Typical output for Previous: Pflevuows

(my very simple) Code, takes the image as an argument:

import pytesseract
import sys
from PIL import Image

print(pytesseract.image_to_string(Image.open(sys.argv[1])))
print(sys.argv[1])

EDIT: Applying a threshold binary can get me next, but does not seem to get previous still.

Community
  • 1
  • 1
Alter
  • 3,332
  • 4
  • 31
  • 56
  • You could try OpenCV for OCR or segmentation or preprocessing (filling outlined text, or filling background and inverting the image). – handle Jun 23 '16 at 20:31
  • I tried using floodfill but it the space between the E and X wasn't caught. When I tried to invert it, I didnt get any text back – Alter Jun 23 '16 at 20:38
  • It looks like what I want is called skeletization. I've started reading up on opencv to see if it can help. Somebody, save me :( – Alter Jun 24 '16 at 20:23
  • Well I have installed OpenCV and the Python bindings, but don't count on it. Is your problem limited to this exact font? Does tesseract decode it properly when you fill the outlines manually? – handle Jun 24 '16 at 20:33
  • No, I found in tesseract-ocr's git there is a [pdf doc](https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/3CharacterClassifiers.pdf) that says that current methods of skeletization are unreliable. When 'current' was, I do not know, but there may be a solution out there (I just haven't found it yet). – Alter Jun 24 '16 at 23:13

1 Answers1

2

This is probably too late for you, but it'll help anyone who sees this. I had this same problem and I fixed it. (Solution is using OpenCV)

First, use a binary threshold. With the right values, your letters shouldn't touch and this should work well. This is specifically so you can floodfill with success instead of getting stuck on faded gray colors (which it seems is what happened when you tried it before)

After this, floodfill with black. Since your letters don't touch the borders this should fill everything, although when I was doing it, I had to call floodfill on every outermost pixel in the image.

Lastly, flip the image colors. This can be done with cv2.bitwise_not(). Now it should be ready for OCR.

raghav m
  • 36
  • 3
  • A little bit late ;) but as you say hopefully helpful to others. I've used floodfill for a number of problems since then. – Alter Aug 11 '21 at 19:11