0

I get this cropped image from my pdf: cropped speckled image after pdf to jpeg conversion

After preprocessing this is how I feed it to Tesseract OCR

text = pytesseract.image_to_string(img, lang='eng')

But the ocr'ed text is empty.

Edit:

I load the full image and crop it to this. Once it is cropped I apply sharpening filter to it and then remove salt and pepper

pages = convert_from_path("../data/2.pdf", fmt='JPEG',
                          poppler_path=r"D:\poppler-0.68.0\bin")

reader = easyocr.Reader(['en']) # need to run only once to load model into memory
for page in pages:

      page.save('image.jpg', 'JPEG')
      image = cv2.imread('image.jpg')
        
      img = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
      img = img[cord[2]:cord[3], cord[0]:cord[1]]
      kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
      img = cv2.filter2D(img, -1, kernel)
      img = cv2.medianBlur(img, 3)
      text = pytesseract.image_to_string(img)

This image is a part of PDF. PDF is converted to jpg and then loaded again and then this section is cropped out by giving BB coordinates.

Edit: Using the example below this is the output after preprocessing: Preprocessed image after suggested ansnwer

But the ocr'ed text output it prints is still off:

AQ@O FCI
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
Fatima Arshad
  • 119
  • 1
  • 9
  • 11
    This question is being discussed on [Meta Stack Overflow](https://meta.stackoverflow.com/q/404771/215552) cc @GinoMempin – Heretic Monkey Jan 27 '21 at 15:26
  • Your code would get you a IndentationError - so thats _not_ what you are running. – Patrick Artner Jan 29 '21 at 07:25
  • This is not really a good question for SO. Your code _works_ - its just that the OCR done by tesseract is not "up to par" to what your brain can do - big surprise. The choice of preprocessing to be done is highly input-dependent - what works for one image may or may not work for others. Diskussing things to do for preprocessing is more a tutorial then something that can/should be done here. I suggest researching preprocessing image methods - there are even older posts that do that: [tesseract ocr](https://stackoverflow.com/questions/28935983/preprocessing-image-for-tesseract-ocr-with-opencv) – Patrick Artner Jan 29 '21 at 07:31
  • Some other ocr quesstions that might help you out: https://stackoverflow.com/questions/54940022/opencv-image-transformation-for-tesseract-ocr , https://stackoverflow.com/questions/60624019/how-to-improve-ocr-with-pytesseract-text-recognition , https://stackoverflow.com/questions/64099248/pytesseract-improve-ocr-accuracy , https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy , ... (google with: _ocr dilatate erode improve site:stackoverflow.com_) – Patrick Artner Jan 29 '21 at 07:40
  • Edited: inlcuded speckled image back in (without this Q makes no sense) and fixed the IndentationError (ealier revisions did not have it - so probably caused by editing) – Patrick Artner Jan 29 '21 at 09:02

1 Answers1

3

I have a two-step solution


    1. Apply Dilation followed by Erosion (Closing)
    1. Apply thresholding.

Now why do we apply dilation followed by erosion?

As we can see the input image is consisting of artifacts around each character. Applying Closing operation will reduce the artifacts.

enter image description here

The artifacts are reduced but not completely gone. Therefore if we apply adaptive-threshold, result will be:

enter image description here

Now the image is suitable for reading:

AOF CIF

Code:


import cv2
from pytesseract import image_to_string

img = cv2.imread("7UGLJ.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*2, h*2))
cls = cv2.morphologyEx(gry, cv2.MORPH_CLOSE, None)
thr = cv2.adaptiveThreshold(cls, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY, 41, 10)
txt = image_to_string(thr)
print(txt)
Ahmet
  • 7,527
  • 3
  • 23
  • 47