Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

Question

I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.

Sample 1:

tesseract D:\tesseract\tesseract_test_images\test.png output

And the output of the sample image is this.

je Beutel

13

However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.

Processed image(cropped and shrinked price):

je Beutel

1,89

How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images? Edit: Added more price tags:
sample2 sample3 sample4 sample5 sample6 sample7

Try come up with an algorithm which uses e.g. the `cv2.connectedComponents` and `cv2.boundingRect` functions to detect connected regions which are of dissimilar size on the same horizontal region. You can then call `tesseract` after either enlarging the smaller regions, shrinking the larger regions, or isolate the dissimilar regions and make the call separately. — dROOOze, Mar 28 '18 at 13:44
Can you write down an example to how it might work? perhaps I can feed components one by one and it would still work, but connectedComponent returns a black image — NONONONONO, Mar 28 '18 at 16:12
See https://stackoverflow.com/questions/43547540/cv2-connectedcomponents-not-detecting-components — dROOOze, Mar 28 '18 at 16:18

score 15 · Answer 1 · answered Apr 19 '18 at 23:19

15

The problem is the image you are using is of small size. Now when tesseract processes the image it considers '8', '9' and ',' as a single letter and thus predicts it to '3' or may consider '8' and ',' as one letter and '9' as a different letter and so produces wrong output. The image shown below explains it.

A simple solution could be increasing its size by factor of 2 or 3 or even more as per the size of your original image and then passing to tesseract so that it detects each letter individually as shown below. (Here I increased its size by factor of 2)

Bellow is a simple python script that will solve your purpose

import pytesseract
import cv2

img = cv2.imread('dKC6k.png')
img = cv2.resize(img, None, fx=2, fy=2)

data = pytesseract.image_to_string(img)
print(data)

Detected text:

je Beutel

89
1.

Now you can simply extract the required data from the text and format it as per your requirement.

data = data.replace('\n\n', '\n')
data = data.split('\n')

dollars = data[2].strip(',').strip('.')
cents = data[1]

print('{}.{}'.format(dollars, cents))

Desired Format:

1.89

answered Apr 19 '18 at 23:19

skt7

1,197
8
21

The questioner has clearly mentioned that he/she is trying to detect price labels text which are always clearly preprocessed in the shown format. – skt7 Apr 20 '18 at 00:03
I am updating the question with more test cases, and for almost all this does not work and in your answer 89 being recognized in front of 1 is saying something is wrong with it too(they should have been in the same line and 1 is not below 89, also the comma is recognized as dot). I am really focusing more on the part that there is digits on top of comma. – NONONONONO Apr 20 '18 at 04:19
1

This is how tesseract works, it recognizes characters and prints text on the basis of position of it recognized them. You will have to somehow understand this or need to train your own model that perfectly works as per your convince which I think is more preferable in your scenario as you need to process images with same formatting. – skt7 Apr 20 '18 at 04:36
@NONONONONO can you upload images to a GitHub repo and share the link so I can more clearly understand your dataset and suggest you something accordingly. – skt7 Apr 20 '18 at 04:39
1

I really can not as they are really something I should not be sharing but, added a few test cases anyhow. I am not sure what you meant by "position" because as you can see despite 89 being in the same line and right to the 1, it failed to be recognized as 1,89(just like reading). Also, image size evidently is not the problem, as the letters above price numbers (for all the images I have) are recognized correctly. I moved to a completely new architecture for recognizing price digits. – NONONONONO Apr 20 '18 at 05:08
https://gist.github.com/skt7/f98042c6c9c8bd81095fedadd322094e use this code to analyze all your images and you can then come up with a way to parse different types of text returned by tesseract. You need to try different resizeFactor as it changes the output. – skt7 Apr 20 '18 at 17:07
I have a image which has Characters which are not horizontally aligned and are of different font size. I have tried your approach but no luck :( – Anuj Teotia Apr 24 '18 at 05:09
This code specifically works for the particular case shown. Can you provide the image? – skt7 Apr 25 '18 at 06:42
I am sorry but your approach is wrong and the answer is misleading, it is not about the image being smaller size, since even smaller letters are recognized correctly above it. My theory is that you stretch it along the X axis more so than the Y axis, because of the proportions being a rectangle, so characters stacked on top of comma are seperated a little bit more and recognized individually, however it is still read wrong (i.e. 89\n1.) – NONONONONO Apr 25 '18 at 10:20
I answered that, you need to findout some simple hacks to crack this, and this was what I came out with and even said you can play with the code to make you own. I also clearly mentioned that you can train your own model but that would need some serious work. This hack was just to give you an idea how you can achieve different things using simple manipulations to image. – skt7 Apr 25 '18 at 19:34

score 6 · Accepted Answer · answered Apr 23 '18 at 12:32

The problem is that the Tesseract engine was not trained to read this kind of text topology.

You can:

train your own model, and you'll need in particular to provide images with variations of topology (position of characters). You can actually use the same image, and shuffle the positions of the characters.
reorganize the image into clusters of text and use tesseract, in particular, I would consider the cents part and move it on the right of the coma, in that case you can use tesseract out of the box. Few relevant criterions would be the height of the clusters (to differenciate cents and integers), and the position of the clusters (read from the left to the right).

In general computer vision algorithms (including CNNs) are giving you tool to have a higher representation of an image (features or descriptors), but they fail to create a logic or an algorithm to process intermediate results in a certain way.

In your case that would be:

"if the height of those letters are smaller, it's cents",
"if the height, and vertical position is the same, it's about the same number, either on left of coma, or on the right of coma".

The thing is that it's difficult to reach that through training, and at the same time it's extremely simple to write this for a human as an algorithm. Sorry for not giving you an actual implementation, but my text is the pseudo code.

TrainingTesseract2

TrainingTesseract4

Joint Unsupervised Learning of Deep Representations and Image Clusters

Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

2 Answers2