How to get text of image using Tesseract

Question

I get this cropped image from my pdf:

After preprocessing this is how I feed it to Tesseract OCR

text = pytesseract.image_to_string(img, lang='eng')

But the ocr'ed text is empty.

Edit:

I load the full image and crop it to this. Once it is cropped I apply sharpening filter to it and then remove salt and pepper

pages = convert_from_path("../data/2.pdf", fmt='JPEG',
                          poppler_path=r"D:\poppler-0.68.0\bin")

reader = easyocr.Reader(['en']) # need to run only once to load model into memory
for page in pages:

      page.save('image.jpg', 'JPEG')
      image = cv2.imread('image.jpg')
        
      img = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
      img = img[cord[2]:cord[3], cord[0]:cord[1]]
      kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
      img = cv2.filter2D(img, -1, kernel)
      img = cv2.medianBlur(img, 3)
      text = pytesseract.image_to_string(img)

This image is a part of PDF. PDF is converted to jpg and then loaded again and then this section is cropped out by giving BB coordinates.

Edit: Using the example below this is the output after preprocessing:

But the ocr'ed text output it prints is still off:

AQ@O FCI

This question is being discussed on [Meta Stack Overflow](https://meta.stackoverflow.com/q/404771/215552) cc @GinoMempin — Heretic Monkey, Jan 27 '21 at 15:26
Your code would get you a IndentationError - so thats _not_ what you are running. — Patrick Artner, Jan 29 '21 at 07:25
This is not really a good question for SO. Your code _works_ - its just that the OCR done by tesseract is not "up to par" to what your brain can do - big surprise. The choice of preprocessing to be done is highly input-dependent - what works for one image may or may not work for others. Diskussing things to do for preprocessing is more a tutorial then something that can/should be done here. I suggest researching preprocessing image methods - there are even older posts that do that: [tesseract ocr](https://stackoverflow.com/questions/28935983/preprocessing-image-for-tesseract-ocr-with-opencv) — Patrick Artner, Jan 29 '21 at 07:31
Some other ocr quesstions that might help you out: https://stackoverflow.com/questions/54940022/opencv-image-transformation-for-tesseract-ocr , https://stackoverflow.com/questions/60624019/how-to-improve-ocr-with-pytesseract-text-recognition , https://stackoverflow.com/questions/64099248/pytesseract-improve-ocr-accuracy , https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy , ... (google with: _ocr dilatate erode improve site:stackoverflow.com_) — Patrick Artner, Jan 29 '21 at 07:40
Edited: inlcuded speckled image back in (without this Q makes no sense) and fixed the IndentationError (ealier revisions did not have it - so probably caused by editing) — Patrick Artner, Jan 29 '21 at 09:02

score 3 · Answer 1 · answered Jan 27 '21 at 16:48

3

I have a two-step solution

1. Apply Dilation followed by Erosion (Closing)
1. Apply thresholding.

Now why do we apply dilation followed by erosion?

As we can see the input image is consisting of artifacts around each character. Applying Closing operation will reduce the artifacts.

The artifacts are reduced but not completely gone. Therefore if we apply adaptive-threshold, result will be:

Now the image is suitable for reading:

AOF CIF

Code:

import cv2
from pytesseract import image_to_string

img = cv2.imread("7UGLJ.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*2, h*2))
cls = cv2.morphologyEx(gry, cv2.MORPH_CLOSE, None)
thr = cv2.adaptiveThreshold(cls, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY, 41, 10)
txt = image_to_string(thr)
print(txt)

answered Jan 27 '21 at 16:48

Ahmet

7,527
3
23
47

This question might be off topic but how does one choose specific image processing steps to apply? Is it hit and trial? or learnt through experience – Fatima Arshad Jan 28 '21 at 06:55
Image is clearer but it prints out nothing – Fatima Arshad Jan 28 '21 at 07:31
1

What is your pytesseract version? Maybe you should update – Ahmet Jan 28 '21 at 09:30
'0.3.7' ....... – Fatima Arshad Jan 28 '21 at 14:52
Please update it to the latest version and try again – Ahmet Jan 28 '21 at 15:56
@ahx https://pypi.org/project/pytesseract/ => actual version is 0.3.7 ? – Patrick Artner Jan 29 '21 at 07:37
1

@PatrickArtner I'm sorry, I was expecting an output of `print(pytesseract.get_tesseract_version())` which is 4.1.1. You are right, thanks for the warning. I'll be more careful next time. – Ahmet Jan 29 '21 at 07:45

How to get text of image using Tesseract

1 Answers1