0

I am trying to extract data from a .jpg file using pytesseract but just partial text is extracted that to have spelling mistakes. Could anyone please help suggest how can I extract full text. I have attached .jpg for your reference code snippet which I am using for text extraction.

img=Image.open('page-594-5.jpg')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
text=pytesseract.image_to_string(img,lang='eng')

print(text)

Output I am getting:-enter image description here

Image from where data need to be extracted:- enter image description here

bad_coder
  • 11,289
  • 20
  • 44
  • 72

1 Answers1

0

Pytesseract has its own limitations even on printed text. Though to improve your performance, you can opt for some of the below solutions:

1) Each text character to be around 12pt in size (Font size).

2) Set your image resolution to around 300 DPI

setting image resolution

3) Denoise your image

denoising image using python

Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33