0

I'm new on the OCR world and I have document with numbers to analyse with Python, openCV and pytesserract. The files I received are pdfs and the numbers are not text. So, I converted it to jpg with this :

first_page = convert_from_path(path__to_pdf, dpi=600, first_page=1, last_page=1)
first_page[0].save(TEMP_FOLDER+'temp.jpg', 'JPEG')

Then , the images look like this : I still have some noise around the digits.

enter image description here

I tried to select the "black color only" with this :

img_hsv = cv2.cvtColor(img_raw, cv2.COLOR_BGR2HSV)
img_changing = cv2.cvtColor(img_raw, cv2.COLOR_RGB2GRAY)

low_color = np.array([0, 0, 0])
high_color = np.array([180, 255, 30])

blackColorMask = cv2.inRange(img_hsv, low_color, high_color)

img_inversion = cv2.bitwise_not(img_changing)
img_black_filtered = cv2.bitwise_and(img_inversion, img_inversion, mask = blackColorMask)
img_final_inversion = cv2.bitwise_not(img_black_filtered)

So, with this code, my image looks like this : enter image description here

Even with cv2.blur, I don't even reach 75% of image FULLY analysed. For at least 25% of the images, pytesseract misses 1 or more digits. Is that normal ? Do you have ideas of what I can do to maximize the succesfull rate ?

Thanks

Christoph Rackwitz
  • 11,317
  • 4
  • 27
  • 36
  • This is a high rated answer on digit recognition: https://stackoverflow.com/a/9620295/18667225 Does it help you? – Markus Nov 13 '22 at 18:32
  • 1
    Thank you KJ and Markus, finally I managed to get the numbers by analysing a little part on the side of the document where they have been secretly placed. Same numbers, but with a different font. So, I have 100% of success ! – Romain Che Nov 13 '22 at 20:48

2 Answers2

1

Whenever you see that Tesseract is missing a character or digit, think about page segmentation modes. If the character is not correct but was read, it is a recognition issue.

OCR engines split the text in the image we input, and this splitting is called page segmentation. Then, the engines try to recognize the text. Tesseract supports 13 page modes as follows:

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

For your case, the best solution would be treating your image as a block to avoid missing any digits. Then, restrict the output to digits only to get a better result. Your code should be like this:

text = pytesseract.image_to_string(image, lang='eng',
config='--psm 6 -c tessedit_char_whitelist=0123456789') 
print(text)

Output:

1821293045013
Esraa Abdelmaksoud
  • 1,307
  • 12
  • 25
  • I've used his preprocessed image. He can apply his image preprocessing to the color image then this code. It will work the same way. – Esraa Abdelmaksoud Nov 15 '22 at 00:18
  • @K J This might be helpful for you regarding the term preprocessed. https://nextgeninvent.com/7-steps-of-image-pre-processing-to-improve-ocr-using-python/ My answer to the person who posted the question was to solve the original problem of missing the characters. You're free of course to post yours. All we care about is solving problems, and he is the one who decides which answer is more helpful. No need for excessive usage of bold font and exclamation marks. :) – Esraa Abdelmaksoud Nov 15 '22 at 01:24
0

Your attempt to process a field entry was thwarted by "artifacts" see upper pair for my best result with your coloured source.

enter image description here

Normal advice is use greyscale but in this case that makes matters worse as there is background chatter.

enter image description here

You were right to attempt thresholding, as that will produce clearer results, however tesseract is prone to odd line and white space insertion when characters are not words.

enter image description here

I suggested you double check if there was no vector data in the file and it appears you uncovered an entry (annotation ?) that matched the data field.

K J
  • 8,045
  • 3
  • 14
  • 36