Python PyTesseract Module returning gibberish from an image

Question

I'm guessing this is because the images I have contain text on top of a picture. pytesseract.image_to_string() can usually scan the text properly but it also returns a crap ton of gibberish characters: I'm guessing it's because of the pictures underneath the text making Pytesseract think they are text too or something.

When Pytesseract returns a string, how can I make it so that it doesn't include any text unless it's certain that the text is right. Like, if there a way for Pytesseract to also return some sort of number telling me how certain the text is scanned accurately?

I know I kinda sound dumb but somebody pls help

Where is (1) your code and (2) your images? Stack Overflow is not a free code writing service. You are expected to try to write the code yourself. After doing [more research](http://meta.stackoverflow.com/questions/261592) if you have a problem you can post what you've tried with a clear explanation of what isn't working and providing a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). I suggest reading [How to Ask a good question](https://stackoverflow.com/questions/how-to-ask). Also, be sure to [take the tour](https://stackoverflow.com/tour) — bfris, Aug 21 '21 at 00:17

score 0 · Answer 1 · edited Aug 23 '21 at 09:52

You can set a character whitelist with config argument to get rid of gibberish characters,and also you can try with different psm options to get better result.

Unfortunately, it is not that easy, I think the only way is applying some image preprocessing and this is my best:

Firstly I applied some blurring to smoothing:

 import cv2
 blurred = cv2.blur(img,(5,5))

Then to remove everything except text, converted image to grayscale and applied thresholding to get only white color which is the text color (I used inverse thresholding to make text black which is the optimum condition for tesseract ocr):

gray_blurred=cv2.cvtColor(blurred, cv2.COLOR_BGR2GRAY)
ret,th1 = cv2.threshold(gray_blurred,239,255,cv2.THRESH_BINARY_INV)

and applied ocr then removed whitespace characters :

txt = pytesseract.image_to_string(th1,lang='eng', config='--psm 12')
txt = txt.replace("\n", " ").replace("\x0c", "")
print(txt)
>>>"WINNING'OLYMPIC  GOLD MEDAL  IT'S MADE OUT OF  RECYCLED ELECTRONICS "

Python PyTesseract Module returning gibberish from an image

1 Answers1