Identify and extract table from pdf using java

Question

I have different types of pdf which contain multiple things like text, table etc. The table may exist any place of pdf(top, middle, bottom). I want to extract only table data(No. of the column, no. of rows & data in a table) from that pdf using java without passing location.

What I have done till yet:-

1. I have used iText java API to read and extract. Following code used:-

PdfTextExtractor.getTextFromPage

but It is only returning data in form of text. Didn't get any clue to identify where table exists in pdf and how to extract data from that table.

2. I have also used PDFBox java API but it didn't solve my problem too.

3. I have also followed this stack overflow link:- PDF table extraction But it is not giving me expected output. This algorithm needs except line position and all.

I am not able to identify where to locate the table in pdf.

Can anybody tell me how to solve this problem using iText & PDF box API or is there any open source API which can help me to solve this problem?

Or can we convert pdf into html so that by table tags we can identify table and read ;)?

have a look here: http://stackoverflow.com/a/38933039/535646 — Tilman Hausherr, Mar 31 '17 at 12:23

Rishu Shrivastava · Answer 1 · 2020-12-31T19:29:00.527

4

You can try using Tabula which is an open-source tool to detect and extract tables from pdf documents. You can extend tabula-java and extract the table details. More can be found here.

If you are also looking to extract text from the document then you can use PDFBox or Apache Tika for extracting texts only.

edited Dec 31 '20 at 19:29

answered Jul 20 '19 at 13:08

Rishu Shrivastava

3,745
1
20
41

the solution you have provided is perfectly working for me,this is the best solution ever – Ashish Agrawal Yodlee Dec 19 '19 at 07:59

score 0 · Answer 2 · answered Mar 31 '17 at 12:21

It basically depends on your input document, and how much effort you're willing to put into this project.

A pdf does not work like an html-document. In html documents you have logical tags like "table" or "paragraph". A pdf document (in the most basic case) contains only the instructions needed to render the document. So instead of getting "table" you might get "draw a line here, and another one a bit further away, and then another one that crosses both, and so on"

Also, according to the pdf specification, these instructions don't even have to appear in logical (reading) order.

If you are lucky, your input pdf might be a tagged PDF. Tagged pdfs contain an internal representation of the underlying structure in the document. A tagged pdf might be able to tell you exactly which objects in the document make up the table.

Now, to get back to an actual answer. If you want a solution that always works, you can implement the iText7 IEventListener class. This class has a method eventOccurred() that gets called every time the parser has finished dealing with an object (like a piece of text, a line, etc)

If you then look out for lines, and build some heuristic to determine when a collection of lines constitutes a table, you should be able to detect tables.

IText also plans on releasing a pdf2Data addon, which will basically do the heavy lifting for you.

I think iText 7 is not open source. – Gourav Saklecha Mar 31 '17 at 14:16 — Gourav Saklecha, Mar 31 '17 at 14:16

Identify and extract table from pdf using java

2 Answers2

Linked