I have different types of pdf which contain multiple things like text, table etc. The table may exist any place of pdf(top, middle, bottom). I want to extract only table data(No. of the column, no. of rows & data in a table) from that pdf using java without passing location.
What I have done till yet:-
1. I have used iText java API to read and extract. Following code used:-
PdfTextExtractor.getTextFromPage
but It is only returning data in form of text. Didn't get any clue to identify where table exists in pdf and how to extract data from that table.
2. I have also used PDFBox java API but it didn't solve my problem too.
3. I have also followed this stack overflow link:-
PDF table extraction
But it is not giving me expected output. This algorithm needs except line position and all.
I am not able to identify where to locate the table in pdf.
Can anybody tell me how to solve this problem using iText & PDF box API or is there any open source API which can help me to solve this problem?
Or can we convert pdf into html so that by table tags we can identify table and read ;)?