-1

enter image description hereI am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.

I used the below text for extraction.Should I use a different Encoding?

import os
if  os.path.isfile('output.hocr'):
    fp=open('output.hocr','r',encoding='UTF-8')
    text=fp.read()
    fp.close()

Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.

2 Answers2

0

You can simply provide the image as input, instead of processing and creating an HOCR output file.

Try:-

from PIL import Image
import pytesseract

im = Image.open("reciept.jpg")

text = pytesseract.image_to_string(im, lang = 'eng')

print(text)

This program takes in the location of your image which is to be run through OCR, and extracts text from it, stores it in a variable text, and prints it out. If you want you can store the data in text in a separate file too.

P.S.:- The Image that you are trying to process, is way too complex as compared to images that tesseract is made to deal with. Due to this you may get incorrect results, after the text is processed. I would definitely recommend you to optimize it before using, like reducing the character set used, processing the image before passing it to OCR, upsampling image, having dpi over 250 etc.

Vasu Deo.S
  • 1,820
  • 1
  • 11
  • 23
  • Can you share some links or examples for the same? – Shrinidhi Narasimhan May 31 '19 at 12:59
  • [Read this link](https://stackoverflow.com/questions/56303292/identify-clear-text-from-image-python/56303477#56303477), this link contains methods you could use to further improve the results. – Vasu Deo.S May 31 '19 at 14:12
  • What about the written characters?I mean since this is a form, and i need to extract handwritten dates and account numbers, Do you think any additionalt training is required? – Shrinidhi Narasimhan Jun 01 '19 at 06:46
  • You can use `opencv`, and train it to deal with specifically your type of image – Vasu Deo.S Jun 01 '19 at 06:55
  • I tried using edge detection and applying pytesseract.image_to_string() method but getting the same results.I need to figure out a way to segment the image and retrieve text from only within the boxes of the form. – Shrinidhi Narasimhan Jun 01 '19 at 09:14
  • Read my answer again, I have already stated it all in the P. S. – Vasu Deo.S Jun 01 '19 at 09:25
0

I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...

or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..

VeNoMouS
  • 314
  • 2
  • 5