0

I want to extract a number from an image. I am using Tesseract OCR with Python to extract the number. But the tesseract OCR is not functioning properly. The image is of the following format: Image

The Text is in Arial Font and the font size is 80. The code that I am using is following:

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
def process_image(iamge_name, lang_code):
    return pytesseract.image_to_string(Image.open(iamge_name), lang=lang_code)

def print_data(data):
    print(data)

def main():
    data_eng = process_image("test.jpg", "eng")
    print_data(data_eng)

if  __name__ == '__main__':
    main()

Using this code, Tesseract is not able to detect the number. There are around 2,00,000 images from which I need to extract the number. It would be really helpful if someone can give me a workaround for the same. Any help is appreciated.

Thanks in Advance

2 Answers2

0

This should work.

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = "C:\Tesseract-OCR\\tesseract.exe"
def process_image(iamge_name, lang_code):
    return pytesseract.image_to_string(Image.open(iamge_name), lang=lang_code)

def print_data(data):
    print(data)

def main():
    
    data_eng = pytesseract.image_to_string(Image.open('test.jpg'), config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789.')
    print(data_eng)

if  __name__ == '__main__':
    main()

Here is the result:

1640778161.134756

Process finished with exit code 0

Although it is not guaranteed that all your images would give you the result with accuracy. If the images are not clear, then you may need preprocessing them. Here is the PyTessearct SO link for your further study

Anand Gautam
  • 2,018
  • 1
  • 3
  • 8
0

or you can refer to easyocr

reader = easyocr.Reader(['en'], gpu=False)
resu = reader.readtext(
    'foo.png',
    allowlist ='0123456789'
    )
lam vu Nguyen
  • 433
  • 4
  • 9