1

I'm guessing this is because the images I have contain text on top of a picture. pytesseract.image_to_string() can usually scan the text properly but it also returns a crap ton of gibberish characters: I'm guessing it's because of the pictures underneath the text making Pytesseract think they are text too or something.

When Pytesseract returns a string, how can I make it so that it doesn't include any text unless it's certain that the text is right. Like, if there a way for Pytesseract to also return some sort of number telling me how certain the text is scanned accurately?

I know I kinda sound dumb but somebody pls help

Charlie
  • 11
  • 1
  • 1
    Where is (1) your code and (2) your images? Stack Overflow is not a free code writing service. You are expected to try to write the code yourself. After doing [more research](http://meta.stackoverflow.com/questions/261592) if you have a problem you can post what you've tried with a clear explanation of what isn't working and providing a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). I suggest reading [How to Ask a good question](https://stackoverflow.com/questions/how-to-ask). Also, be sure to [take the tour](https://stackoverflow.com/tour) – bfris Aug 21 '21 at 00:17

1 Answers1

0

You can set a character whitelist with config argument to get rid of gibberish characters,and also you can try with different psm options to get better result.

Unfortunately, it is not that easy, I think the only way is applying some image preprocessing and this is my best:

  1. Firstly I applied some blurring to smoothing:
 import cv2
 blurred = cv2.blur(img,(5,5))
  1. Then to remove everything except text, converted image to grayscale and applied thresholding to get only white color which is the text color (I used inverse thresholding to make text black which is the optimum condition for tesseract ocr):
gray_blurred=cv2.cvtColor(blurred, cv2.COLOR_BGR2GRAY)
ret,th1 = cv2.threshold(gray_blurred,239,255,cv2.THRESH_BINARY_INV)

enter image description here

and applied ocr then removed whitespace characters :

txt = pytesseract.image_to_string(th1,lang='eng', config='--psm 12')
txt = txt.replace("\n", " ").replace("\x0c", "")
print(txt)
>>>"WINNING'OLYMPIC  GOLD MEDAL  IT'S MADE OUT OF  RECYCLED ELECTRONICS "

Related topics:

Pytesser set character whitelist

Pytesseract OCR multiple config options

You can also try preprocessing your image to let pytesseract work more accurate and if you want to recognize meaningful words you can apply spell check after ocr:

https://pypi.org/project/pyspellchecker/

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
  • However, when I say gibberish characters it's usually a bunch of gibberish alphanumeric characters. Is there any way I can have it get rid of characters like "brfxxcmmdwmnlqw" – Charlie Aug 20 '21 at 21:46
  • Can you show an example of image that you apply ocr ? – cagataygulten Aug 20 '21 at 22:04
  • Here is an example, I downloaded a bunch of meme images from reddit and I wanna extract text from them. I don't know why they return the text along with random gibberish letters https://ibb.co/b78qW2x – Charlie Aug 20 '21 at 23:01