Reading tables and images from PDF using any NLP tools

Question

In one of my NLP assignments I have to read PDF files and extract information out of them. Using Java I am able to read the textual content from PDF and able to apply our NLP algorithms on the text, but I also need to extract information present in Tables in PDF, I am trying to read them but not able to get them in proper format. Any idea how I can read tables from PDF document , or any hint if any library is available in OpenNLP, GATE, Stanford NLP for achieving these.

score 2 · Answer 1 · answered May 26 '16 at 15:21

Unfortunately, tables as structures are not stored in PDFs. You have to apply some serious coordinate math to figure out/estimate where a table is, where the columns are and where the rows are.

For PDFs, Apache Tika doesn't have any special table handling (it does for MSWord, MSPPT and many other formats, but not PDFs).

To extract tables as tables from PDFs, you might consider tabulapdf; see also John Hewson's recommendation. There are also commercial tools that likely do a decent job with table extraction from PDFs -- Abby Finereader, Nuance *PDF products.

Thanks for the comments, I have already started evaluating tabulapdf, tweeking some code of it, able to get table content but not to full extent. Will update as and when done. — NKS, May 27 '16 at 04:42

Reading tables and images from PDF using any NLP tools

1 Answers1