How to obtain the best result from pytesseract?

Question

I'm trying to read text from an image, using OpenCV and Pytesseract, but with poor results.

The image I'm interested in reading the text is: https://www.lubecreostorepratolapeligna.it/gb/img/logo.png

This is the code I am using:

 pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\pytesseract\tesseract.exe'
 image = cv2.imread(path_to_image)
 # converting image into gray scale image
 gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 cv2.imshow('grey image', gray_image)
 cv2.waitKey(0)
 # converting it to binary image by Thresholding
 # this step is require if you have colored image because if you skip this part
 # then tesseract won't able to detect text correctly and this will give incorrect result
 threshold_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
 # display image
 cv2.imshow('threshold image', threshold_img)
 # Maintain output window until user presses a key
 cv2.waitKey(0)
 # Destroying present windows on screen
 cv2.destroyAllWindows()
 # now feeding image to tesseract
 text = pytesseract.image_to_string(threshold_img)
 print(text)

The result of the execution is : ["cu"," ","LUBE"," ","STORE","PRATOLA PELIGNA"]

But the result should be these 7 words: ["cucine", "LUBE", "CREO", "kitchens", "STORE", "PRATOLA", "PELIGNA"]

Is there anyone who could help me to solve this problem ?

Twenkid · Answer 1 · 2020-12-17T02:40:46.117

Edit, 17.12.2020: Using preprocessing now it recognizes all, but the "O" in CREO. See the stages in ocr8.py. Then ocr9.py demonstrates (but not automated yet) finding the lines of text by the coordinates returned from pytesseract.image_to_boxes(), approcimate size of the letters and inter-symbol distance, then extrapolating one step ahead and searching for a single character (--psm 8).

It happened that Tesseract had actually recognized the "O" in CREO, but it read it as ♀, probably confused by the little "k" below etc.

Since it is a rare and "strange"/unexpected symbol, it could be corrected - replaced automatically (see the function Correct()).

There is a technical detail: Tesseract returns the ANSI/ASCII symbol 12, (0x0C) while the code in my editor was in Unicode/UTF-8 - 9792. So I coded it inside as chr(12).

The latest version: ocr9.py

You mentioned that PRATOLA and PELIGNA have to be given sepearately - just split by " ":

 splitted = text.split(" ")

RECOGNIZED

CUCINE

LUBE

STORE

PRATOLA PELIGNA

CRE [+O with correction and extrapolation of the line]

KITCHENS

...
C 39 211 47 221 0
U 62 211 69 221 0
C 84 211 92 221 0
I 107 211 108 221 0
N 123 211 131 221 0
E 146 211 153 221 0
L 39 108 59 166 0
U 63 107 93 166 0
B 98 108 128 166 0
E 133 108 152 166 0
S 440 134 468 173 0
T 470 135 499 173 0
O 500 134 539 174 0
R 544 135 575 173 0
E 580 135 608 173 0
P 287 76 315 114 0
R 319 76 350 114 0
A 352 76 390 114 0
T 387 76 417 114 0
O 417 75 456 115 0
L 461 76 487 114 0
A 489 76 526 114 0
P 543 76 572 114 0
E 576 76 604 114 0
L 609 76 634 114 0
I 639 76 643 114 0
G 649 75 683 115 0
N 690 76 722 114 0
A 726 76 764 114 0
C 21 30 55 65 0
R 62 31 93 64 0
E 99 31 127 64 0
K 47 19 52 25 0
I 61 19 62 25 0
T 71 19 76 25 0
C 84 19 89 25 0
H 96 19 109 25 0
E 113 19 117 25 0
N 127 19 132 25 0
S 141 19 145 22 0

These are from getting "boxes".

Initial message:

I guess that for the area where "cucine" is, an adaptive threshold may segment it better or maybe applying some edge detection first.

Kitchens seems very small, what about trying to enlarge that area/distance.

For the CREO, I guess it's confused with the big and small size of adjacent captions. For the "O" in creo, you may apply dilate in order to close the gap of the "O".

Edit: I played a bit, but without Tesseract and it needs more work. My goal was to make the letters more contrasting, may need some of these processings to be applied selectively only on the Cucine, maybe applying the recognition in two passes. When getting those partial words "Cu", apply adaptive threshold etc. (below) and OCR on a top rectangle around "CU..."

Binary Threshold:

Adaptive Threshold, Median blur (to clean noise) and invert:

Dilate connects small gaps, but it also destroys detail.

import cv2
import numpy as np
#pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\pytesseract\tesseract.exe'
path_to_image = "logo.png"
#path_to_image = "logo1.png"
image = cv2.imread(path_to_image)
h, w, _ = image.shape
w*=3; h*=3
w = (int)(w); h = (int) (h)
image = cv2.resize(image, (w,h), interpolation = cv2.INTER_AREA) #Resize 3 times
# converting image into gray scale image
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('grey image', gray_image)
cv2.waitKey(0)
# converting it to binary image by Thresholding
# this step is require if you have colored image because if you skip this part
# then tesseract won't able to detect text correctly and this will give incorrect result
#threshold_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# display image
threshold_img = cv2.adaptiveThreshold(gray_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,13,3) #cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11,2)[1]
cv2.imshow('threshold image', threshold_img)            
cv2.waitKey(0)
#threshold_img = cv2.GaussianBlur(threshold_img,(3,3),0)
#threshold_img = cv2.GaussianBlur(threshold_img,(3,3),0)
threshold_img = cv2.medianBlur(threshold_img,5)
cv2.imshow('medianBlur', threshold_img)            
cv2.waitKey(0)
threshold_img  = cv2.bitwise_not(threshold_img)
cv2.imshow('Invert', threshold_img)            
cv2.waitKey(0)
#kernel = np.ones((1, 1), np.uint8)   
#threshold_img = cv2.dilate(threshold_img, kernel)  
#cv2.imshow('Dilate', threshold_img)            
#cv2.waitKey(0)
cv2.imshow('threshold image', thrfeshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()
# now feeding image to tesseract
text = pytesseract.image_to_string(threshold_img)
print(text)

Hi thank you for your help, your solution gives in output : "Si","PIRATOLA PIELIGNA". What we need more to get the desired result ? — Matteo, Dec 13 '20 at 18:55
Try without the invert: threshold_img = cv2.bitwise_not(threshold_img) I don't know how Tesseract deals with such letters with contours, I guess it prefers filled and you may try to resize it back. Also if the right side is correctly recognized from the beginning, the additional recognition should be applied only on the left side, the specific regions (try findContours also).The "kitchen" part could be cropped and/or recognized separately in order not to mess with the caption above it, without these processing as well.I think the letters are just too small and too close to the caption above. — Twenkid, Dec 13 '20 at 20:50
Hi, I managed to transform the image so that the ocr recognizes "Cucine", but only if I feed it with cropped image of the box with "LUBE" (which is not recognized). If the whole image is provided, it still doesn't notice the left side. https://github.com/Twenkid/ComputerVision_Pyimagesearch_OpenCV_Dlib_OCR-Tesseract-DL/blob/master/OCR_Tesseract/logo_without_threshold_median_etc.png The code: https://github.com/Twenkid/ComputerVision_Pyimagesearch_OpenCV_Dlib_OCR-Tesseract-DL/blob/master/OCR_Tesseract/ocr2_cucine.py So it may need a multipass recognition. — Twenkid, Dec 13 '20 at 21:53
thank you a lot @Todor for your help. I tried also to give a particular configuration to the method image_to_string, for instance : `custom_config = r'--oem 3 --psm 11' text = pytesseract.image_to_string(threshold_img, config=custom_config )` and it seems it search for more words. What do you think about ? It could solve the problem the configuration ? — Matteo, Dec 13 '20 at 22:10
You're welcome. Sorry, I am not familiar with these configurations... I added "ita" as language, but I think it didn't help right now (maybe it would help for longer texts in order to fix typos etc.): text = pytesseract.image_to_string(g, lang="ita") I think what may help is to segment the image in regions (findContours or with the output of "data") and querry them one by one. Also, separaing the "kitchen". I managed to recognize some letters if cropping it manually, but not completely. (The config you give didn't make a difference) EITEHENS RIT CHEN RITOCHEWAN ... — Twenkid, Dec 13 '20 at 22:59
More playing, I managed to get "kitchens" (from a crop), but with automatic corrections, taking into account similar symbols, it may be extended with a dictionary - comparing the recognized words to it and selecting most similar. I tried also with "tessdata_best":https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata but it recognizes phantom symbols "RM": KITCHERMNS. See: https://github.com/Twenkid/ComputerVision_Pyimagesearch_OpenCV_Dlib_OCR-Tesseract-DL/tree/master/OCR_Tesseract ocr5 _corrections.py kitchens2_ocr5.png kITCHEWS, KITCHEN §& CORRECTED: KITCHENS — Twenkid, Dec 14 '20 at 02:20
As you said i think the best solution is to divide the photo in 4 parts and analyze them separetely, but i don't know how to crop an image in 4 regions and analyze them separetely. Do you have any idea ? — Matteo, Dec 14 '20 at 13:17
Yes, one thing to use is: (OpenCV 4 syntax) cnts, hier = cv2.findContours(image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) for cnt in cnts: x,y,w,h = cv2.boundingRect(cnt) cv2.rectangle(image,(x,y),(x+w,y+h),(0,255,0),1) #w - width, h - height To get the region into a new image: crop = image[y:y+h,x:x+w] See: ocr6_partition.py: https://github.com/Twenkid/ComputerVision_Pyimagesearch_OpenCV_Dlib_OCR-Tesseract-DL/blob/master/OCR_Tesseract/ocr6_partition.py The automatic segmentation of "kitchen" needs more work. — Twenkid, Dec 14 '20 at 13:54

How to obtain the best result from pytesseract?

1 Answers1

Linked