12

Is there any consistent way to extract tables from PDF files? Any tools?

What I have done so far:

  • I have tried out pdftotext tool. It has an option to convert to HTML layout.

What is the problem with this:

  • The table information is not preserved in HTML output
  • I expected <table> tags, but everything was under <p> tags.

Will there be any markers in a PDF document to indicate table structures? Like <table>, <tr> and <td> in HTML?

If "yes", any pointers to this would be helpful. If "no", a definite info about this fact is also helpful.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
Rajneesh
  • 2,185
  • 4
  • 20
  • 30
  • 5
    @GeorgStocker: It is a bit ridiculous to close this question while giving as the reason that the OP should *"describe the problem and what has been done so far to solve it"*. -- The OP clearly said he/she had tried to use `pdftotext` and `pdftohtml`. He described the problem as *"expected table tags but everything was under p tag"*. – Kurt Pfeifle Jan 13 '15 at 19:40
  • 1
    Since my comment I've edited the OP a little bit in order to emphasize better what is being asked. – Kurt Pfeifle Jan 13 '15 at 20:29
  • Duplicate of https://stackoverflow.com/q/59338147/562769 – Martin Thoma Feb 11 '23 at 09:28

2 Answers2

21

What you could do however, is use the pdftotext -layout input.pdf output.txt. It prints the pdf in a text file and contains the original layout. There are no tags, but with a bit of nifty scripting (perl / php / whatever), you can recover the data from the tables.

If you're working on a single page, you're probably better off doing it manually, but if you (like me) have to work on 100's or 1000's of pages, it's about the best you can get. I've been looking around for a long time and can't find any better pdf-2-text tool than pdftotext.

There is a bit of inconsistency in the output, not all similar pdf tables produce a similar looking txt output, but that makes your scripting a little more interesting.

user281681
  • 315
  • 1
  • 7
  • Thanks for reply. I am currently using pdftotext itself. But I have to extract information from pdf which follow variety of layouts (single, two column). I dont think i can write a script which can be applied to all different formats i get. But pdfgenie http://www.pdftron.com/pdfgenie/ is doing a great job. It is extracting tables properly. – Rajneesh Jun 09 '14 at 09:14
  • But pdfgenie is paid though :( – Rajneesh Jun 09 '14 at 09:14
  • 3
    pdftotext with -layout option helped a lot. Thanks. – dlink Oct 20 '14 at 20:39
  • 1
    I find that the -table option works a lot better still – Vic Seedoubleyew Nov 29 '15 at 16:44
  • 1
    @VicSeedoubleyew : I cannot find the -table switch. http://manpages.ubuntu.com/manpages/lucid/man1/pdftotext.1.html doesn;t mention one. pdftotext is version 0.24.5 – Quamis Jan 22 '16 at 10:10
  • @quamis yes i Know, I struggled too. But it is there though. Try it and it should work. Don't know why it isn't in the docs – Vic Seedoubleyew Jan 22 '16 at 22:23
  • @VicSeedoubleyew: `pdftotext -table Specification.pdf Syntax Warning: May not be a PDF file (continuing anyway) Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table` No output txt file generated. Am I missing something? – Quamis Jan 25 '16 at 14:27
  • # pdftotext -table test.pdf I/O Error: Couldn't open file '-table': No such file or directory. (so same problem here) @VicSeedoubleyew where did you get your pdftotext binary? compiled it yourself with special options? – user281681 Jan 25 '16 at 17:54
  • 2
    @Quamis, I am using version 3.04 of pdftotext, downloaded with the xpdf package from http://www.foolabs.com/xpdf/download.html. And it actually prints out "-table" from the usage message. Hope this helps – Vic Seedoubleyew Jan 29 '16 at 20:13
  • @VicSeedoubleyew thanks, I'm using the one provided in Ubuntu reporsitories, and its 0.24.5. I'm not sure how that relates to xpdf tough... Anyway, I'm now sure I'm using a very old version. – Quamis Feb 02 '16 at 09:29
  • 2
    @Quamis: Version 0.24.5 is from the "Poppler" for of the initial XPDF code base. The fork happened in 2005. Nowadays the Poppler tools in general have more features than the original (which also continued to develop), and seem to be better maintained. However, the `-table` parameter of `pdftotext` seems to be one of the features the Poppler fork still misses, and where the original XPDF is superior. The most recent release of XPDF was v3.04 on May 28, 2014. The most recent release of Poppler was v0.43.0 on April 28, 2016 (3 days ago). On May 28, 2014 Popper was at v0.26.0 (17 releases since). – Kurt Pfeifle May 01 '16 at 16:53
13

If the PDF document misses information that marks content as table, row, cell, etc. (known as tags), then there is no consistent way to extract tables from the PDF document. Mostly, PDF documents do not contain these tags. These tags typically serve to make a PDF accessible so that it can for example be read aloud. These tags are not required for a PDF to be valid.

Frank Rem
  • 3,632
  • 2
  • 25
  • 37
  • 9
    **+ 1** -- Good answer, basically. My own answer would have been the same until a few months ago. But then I discovered **[TabulaPDF](https://github.com/tabulapdf/tabula)** and its **[technology](http://tabula.technology/)**. -- Could you please vote to re-open this question, so I can add my answer? – Kurt Pfeifle Jan 13 '15 at 19:59
  • Something else you might try, when pdftotext and Acrobat both fail at extracting tables: open it in Word. I've used `Word 2013`'s built-in conversion with good results. Formatting within cells can be messy, but having the cells collated properly can be a good start. – Jonathan Lidbeck Jun 07 '17 at 17:13