I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic solution for this.I have tried the following libraries
Invoice2Data : It is based on templates.It has given fairly good results in json format till now.But Template creation for complex pdfs containing dynamic table is complex.
Tabula : Table extraction is based on coordinates of the table to be extracted.If the data in the table increases the table length increases and hence the coordinates changes.So in this case it gives wrong results.
Pdftotext : It converts any pdfs to text but with the format that needs lots of parsing which we do not want.
Aws_Textract and Elis_Rossum_Ai : Gives all the data in json format.But if the table column contains multiple line then json parsing becomes difficult.Even the json given is huge in size to parse.
Tesseract : Same as pdftotext.Complex pdfs are not parseable.
Other than all this or with combination of the above libraries has anyone been able to parse complex pdf data please help.