Highly inconsistent OCR result for tesseract

Question

This is the original screenshot and I cropped the image into 4 parts and cleared the background of the image to the extent that I can possibly do but tesseract only detects the last column here and ignores the rest.

The output from the tesseract is shown as it is there are blank spaces which I remove while processing result

  Femme—Fatale.



  DaRkLoRdEIa
  aChineseN1gg4

  Noob_Diablo_

The output from the tesseract is shown as it is there are blank spaces which I remove while processing result

Kicked.

NosNoel
ChikiZD
Death_Eag|e_42

Chai—.

3579 10 1 7 148

2962 3 O 7 101

2214 2 2 7 99

2205 1 3 6 78

Am just dumping the output of

result = `pytesseract.image_to_string(Image.open("D:/newapproach/B&W"+str(i)+".jpg"),lang="New_Language")`

But I do not know how to proceed from here to get a consistent result.Is there anyway so that I can force the tesseract to recognize the text area and make it scan that.Because in trainer (SunnyPage), tesseract on default recognition scan it fails to recognize some areas but once I select the manually everything is detected and translated to text correctly

Code

Can you share the original unprocessed image. Is the data in a table ? — Amarpreet Singh, Sep 19 '17 at 13:41
@AmarpreetSinghSaini added the original image and the cleaned and cropped images and their respective outputs and I just dumping the data in a text file for now .I plan to write use database later once the output is more accurate and reliable — codefreaK, Sep 19 '17 at 17:14
You might try playing with the page segmentation method. There's a list of them here, one might be better suited for your problem than the default: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality — Saedeas, Sep 21 '17 at 20:50
I checked out the page do you have any python documentation of its implementation or any idea where to specify the segmentation attributes — codefreaK, Sep 22 '17 at 17:34
I don't know if you're using PyTesser or not, but there's an example in the original question here in which the OP specifies the page segmentation method (the PSM options): https://stackoverflow.com/questions/16303398/tesseractnotfound-pytesser Hope this helps! — Saedeas, Sep 25 '17 at 19:21

score 11 · Accepted Answer · edited Apr 13 '22 at 14:40

11

Tried with the command line which gives us option to decide which psm value to be used.

Can you try with this:

pytesseract.image_to_string(image, config='--psm 6')

Tried with the image provided by you and below is the result:

Extracted Text Out of Image

The only problem I am facing is that my tesseract dictionary is interpreting "1" provided in your image to ""I" .

Below is the list of psm options available:

pagesegmode values are: 0 = Orientation and script detection (OSD) only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but no OSD, or OCR

3 = Fully automatic page segmentation, but no OSD. (Default)

4 = Assume a single column of text of variable sizes.

5 = Assume a single uniform block of vertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in a circle.

10 = Treat the image as a single character.

edited Apr 13 '22 at 14:40

Smart Manoj

5,230
4
34
59

answered Sep 25 '17 at 22:50

Manoj

853
1
7
28

let me check that and get back to you – codefreaK Sep 26 '17 at 05:10
what I was looking for was how to pass this psm parameter .I find it quite funny since I did not just google image_to_string parameters.Tried everything but not that . https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality checked this sometime back but never saw a documentation. – codefreaK Sep 26 '17 at 05:26
There is a similar answer here: https://stackoverflow.com/questions/44619077/pytesseract-ocr-multiple-config-options with one small (big) difference, one dash instead of two before the `psm`. I tried `--psm` and got nothing, then used `-psm` and the config option made a huge difference. I'm using python 2.7 and tesseract 3.04.01. – elPastor May 04 '19 at 16:42

score 1 · Answer 2 · edited Nov 22 '18 at 13:45

I used this link

https://www.howtoforge.com/tutorial/tesseract-ocr-installation-and-usage-on-ubuntu-16-04/

Just use below commands that may increase accuracy upto 50%`

sudo apt update

sudo apt install tesseract-ocr

sudo apt-get install tesseract-ocr-eng

sudo apt-get install tesseract-ocr-all

sudo apt install imagemagick

convert -h

tesseract [image_path] [file_name]

convert -resize 150% [input_file_path] [output_file_path]

convert [input_file_path] -type Grayscale [output_file_path]

tesseract [image_path] [file_name]

It will only show bold letters

Thanks

Amarpreet Singh · Answer 3 · 2017-09-20T11:18:23.137

0

My suggestion is to perform OCR on the full image.

I have preprocessed the image to get a grayscale image.

import cv2
image_obj = cv2.imread('1D4bB.jpg')
gray = cv2.cvtColor(image_obj, cv2.COLOR_BGR2GRAY)
cv2.imwrite("gray.png", gray)

I have run the tesseract on the image from the terminal and the accuracy also seems to be over 90% in this case.

tesseract gray.png out

3579 10 1 7 148
3142 9 o 5 10
2962 3 o 7 101
2214 2 2 7 99
2205 1 3 6 78
Score Kills Assists Deaths Connection
8212 15 1 4 4o
7198 7 3 6 40
6307 6 1 5 60
5640 2 3 6 80
4884 1 1 5 so

Below are few suggestions -

Do not use image_to_string method directly as it converts the image to bmp and saves it in 72 dpi.
If you want to use image_to_string then override it to save the image in 300 dpi.
You can use run_tesseract method and then read the output file.

Image on which I ran OCR.

Another approach for this problem can be to crop the digits and deep to a neural network for prediction.

edited Sep 20 '17 at 11:18

answered Sep 20 '17 at 06:16

Amarpreet Singh

2,242
16
27

can you add the image on which you ran tesseract scan and regarding 1,2,3 I had read the image converted into greyscale and then saved it as 300+ dpi then did the scan so are you telling me if I pass a 300 dpi image to image_to_string It gets converted into 72 dpi ? – codefreaK Sep 20 '17 at 10:11
No, the image_to_string function doesn't take the argument for custom dpi. You need to override it. – Amarpreet Singh Sep 20 '17 at 11:16
I may get a text from following grayscale.I started out there then ventured deep and tried out many things .So for this to work for me I have to clear the background and split the image in 4 cropped pics of team names and rest of score values .What I have right now is not the issue you just said .I am trying to find solution to something else which is why the purely blank image is not recognized.See I used sunnypage to recognize and train the new language.While using that if you try recognizing only 2 columns are found even if the image is in 300dpi or 1000dpi. – codefreaK Sep 21 '17 at 03:20
but if I manually select the area to for recognizing then it gives me perfect output .The issue at hand is that tesseract is actually skipping over random areas while processing the image even though its of 300 dpi and clear background with only text remaining.I started out with what you did extracting the data from grayscale but it is highly inaccurate for it to work consistently .You need to do what I did ,regarding player names I get 99.9% accuracy most of the time but.I am facing problem regarding the scores of the team as many a time randomly it skips detecting the columns . – codefreaK Sep 21 '17 at 03:21
Whosoever negative votes the answer,please mention the reason for it – Amarpreet Singh Sep 22 '17 at 15:28

score -1 · Answer 4 · answered Sep 25 '17 at 08:59

I think that you have to preprocess the image first, the changes that works for me are: Supposing

import PIL
img= PIL.Image.open("yourimg.png")

Make the image bigger, I usually double the image size.

img.resize(img.size[0]*2, img.size[1]*2)
Grayscale the image

img.convert('LA')
Make the characters bolder, you can see one approach here: https://blog.c22.cc/2010/10/12/python-ocr-or-how-to-break-captchas/ but that approach is fairly slow, if you use it, I would suggest to use another approach
Select, invert selection, fill with black, white using gimpfu

image = pdb.gimp_file_load(file, file) layer = pdb.gimp_image_get_active_layer(image) REPLACE= 2 pdb.gimp_by_color_select(layer,"#000000",20,REPLACE,0,0,0,0) pdb.gimp_context_set_foreground((0,0,0)) pdb.gimp_edit_fill(layer,0) pdb.gimp_context_set_foreground((255,255,255)) pdb.gimp_edit_fill(layer,0)

pdb.gimp_selection_invert(image) pdb.gimp_context_set_foreground((0,0,0))

check the code I made the image 3tiimes the size before I tried cleaning the image.Please check the code before replying.see the screenshot and sample image.I cleared everything made it black and white have done everything you just typed down as the answer the problem I am facing right now is when a clear background image like this here https://i.stack.imgur.com/tytSQ.jpg when fed as input to tesseract it fails to detect other than 2 columns whereas in case of this image https://i.stack.imgur.com/tytSQ.jpg it correctly identifies it.My question is specific the reason for the weird behaviour — codefreaK, Sep 25 '17 at 09:33
If anyone can give me explanation about it and possible fixes to it — codefreaK, Sep 25 '17 at 09:38

score -1 · Answer 5 · answered Jan 06 '20 at 08:28

-1

fn = 'image.png'
img = cv2.imread(fn, 0)
img = cv2.bilateralFilter(img, 20, 25, 25)
ret, th = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)
# Image.fromarray(th)
print(pytesseract.image_to_string(th, lang='eng'))

answered Jan 06 '20 at 08:28

0x01h

843
7
13

1

Please put your answer always in context instead of just pasting code. See [here](https://stackoverflow.com/help/how-to-answer) for more details. – gehbiszumeis Jan 06 '20 at 10:15

Highly inconsistent OCR result for tesseract

Code

5 Answers5