Currently I am creating an OCR application. It's okay to extract the fixed area based on the predefined template but having the difficulty in extracting the line item from the scanned invoices as every invoice has the different line item.
1 Answers
It sounds like you are looking into dynamically extracting information from unstructured forms.
The term 'Unstructured forms processing' refers to capturing data from documents that do not have a fixed structure. Examples of unstructured forms are documents such as purchase orders, invoices, bills, and tabs. These types of documents have a general template but certain parts of the form can vary depending on how many line items or purchases are included in the form.
To extract the data from the form, you will need to use some sort of OCR to convert the image to text. You can use tesseract if you are looking for an open source solution and extract all of the data from the invoice. I did a search on Stack Overflow for using Tesseract on unstructured forms and came across these solutions which you can take a look at :
Tesseract receipt scanning advice needed
How to extract relevant information from receipt
Another option is to look into a commercial solution who has libraries that solve this issue for you. The company I work for LEADTOOLS has an Invoice Recognition and Processing library that allows you to define your Master and then easily process your filled invoices against the invoices. Here is a video overview of the Invoice Recognition and Processing SDK:

- 1,799
- 2
- 16
- 27