29

I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database.

Of course, I know I have to write something to do this. Perl is an option for me - or perhaps Java. I don't really care what language so long as it's free (or cheap with a free trial period to ensure it suits my purposes).

I'm looking at CAM::Parse (using strawberry Perl), but I'm not sure how to use it to locate and extract tables from the files. I guess I do have a preference for Perl, but really I want something that works dependably and is reasonably easy to do string manipulations with.

What is a good approach for something like this? I'm at square one, so if java (or python etc.) have better hooks, now is a good time to know about it. General pointers good; starter code would be strongly preferred.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
elbillaf
  • 1,952
  • 10
  • 37
  • 73
  • Your description of pdfs possibly containing interesting information but possibly formatted differently indicates that you have no real idea what data you have. Before starting a PDF text extraction project please try to analyze the data you will have to process well enough to properly formulate your requirements. – mkl Jun 20 '13 at 22:15

1 Answers1

49
  1. The PDF format from its inception (more than 20 years ago) never was intended to be host of extractable, meaningfully structured data.

  2. Its purpose was to be a reliable visual representation of text, images and diagrams in a document -- a kind of digital paper (that would also reliably be transferred to real paper via printing). Only later in its development more features were added, which should help in extracting data again (google for Tagged PDF).

  3. For some examples of problems which are posed when data scraping tables from PDFs, see this article:

  4. Contradicting my point '1.' above, now I say this: for an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), see these links:

So: go look for Tabula. If any tools can do what you want, at this time Tabula is probably amongst the best for the job!


Update

I've recently created an ASCiinema screencast demonstrating the use of the Tabula command line interface to extract a big table from a PDF as CSV:

asciicast

(Click on image above to see it running. If it runs too fast for you to read all text, make use of the "Pause" button (||-symbol).)

It is hosted here:

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • 2
    The library advised in the upper comment is deprecated. For people with this kind of need, you should use this new library : https://github.com/tabulapdf/tabula-java – Jérôme B Mar 09 '18 at 12:52
  • it just works on textbased pdfs and not on images is there anything similiar to this where it can extract data from pdf images ? – Sundeep Pidugu Nov 30 '18 at 06:06
  • @Sundeep: ***Of course*** it can only work on text-based PDFs. If you want to extract tables from an image, you have to attempt running a process of OCR (optical character recognition) on the image first and then apply the table extraction on the text. Final result quality will largely depend on success of the OCR step. – Kurt Pfeifle Nov 30 '18 at 08:55
  • Iam looking for tools that can do that btw thanks for the info @KurtPfeifle – Sundeep Pidugu Nov 30 '18 at 09:07
  • @Sundeep: You could start looking which tools are mentioned here: https://stackoverflow.com/questions/tagged/ocr – Kurt Pfeifle Nov 30 '18 at 14:23
  • Which is the best solution now (2021)? – Pedro77 Mar 16 '21 at 02:09