Extracting data from Invoices in pdf or image format

Question

I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic solution for this.I have tried the following libraries

Invoice2Data : It is based on templates.It has given fairly good results in json format till now.But Template creation for complex pdfs containing dynamic table is complex.

Tabula : Table extraction is based on coordinates of the table to be extracted.If the data in the table increases the table length increases and hence the coordinates changes.So in this case it gives wrong results.

Pdftotext : It converts any pdfs to text but with the format that needs lots of parsing which we do not want.

Aws_Textract and Elis_Rossum_Ai : Gives all the data in json format.But if the table column contains multiple line then json parsing becomes difficult.Even the json given is huge in size to parse.

Tesseract : Same as pdftotext.Complex pdfs are not parseable.

Other than all this or with combination of the above libraries has anyone been able to parse complex pdf data please help.

Did you try to open the PDF with MS Word, save it to xml, and then parse it? — RobertBaron, Jun 02 '19 at 11:05

Yashraj Nigam · Answer 1 · 2020-08-24T10:30:54.080

I am working on a similar business problem. since invoices don't have fixed format so you can't directly use any text parsing method.

To solve this problem you have to use Computer Vision (Deep Learning) for field detection and Pytesseract OCR for converting image into text. For better understanding here are the steps:

Convert invoices to image and annotate the images with fields like address, Amount etc using tools like labelImg. (For better results use different types of 500-1000 invoices)
After Generating XML files train any object detection model like YOLO or TF object detection API.
The model will detect the fields and gives you coordinates of Region Of Interest(ROI). like
Apply Pytessract OCR on the ROI coordinates. Click Here
Finally, use regex to validate the text in the extracted field and perform any manipulation/transformation that is necessary. At last store data to CSV OR Database.

Hope my answer helps you! Upvote answer so it reaches to maximum people.

Please share any sample script for the above script to explore more into this. — Manz, Oct 13 '20 at 18:35
Hope it helps: [Medium](https://medium.com/@vigneshgig/how-to-extract-the-structure-of-invoice-data-using-tensorflow-api-faster-crnn-object-detection-8aa15c12bb46) . — Yashraj Nigam, Oct 14 '20 at 10:49

Extracting data from Invoices in pdf or image format

1 Answers1