4

I am trying to extract a table from a PDF document (example). It's not a scan/an image, so please focus on non-OCR solutions. OCR table extraction is here.

I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces garbage, maybe because of the font, the document is not in English.

Extracting the pdf using x and y coordinate is not an option as this solution needs to work for future pdf from the url mention above which will have the table but not always in the same position.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
meadhikari
  • 971
  • 3
  • 13
  • 27
  • 1
    The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task. – mkl Jul 11 '13 at 10:56
  • @mkl so in short, if its not a do or die situation i am better of not thinking about parsing this pdf? :) – meadhikari Jul 11 '13 at 11:03
  • 1
    I did something like this once using [PDFMiner](https://pypi.python.org/pypi/pdfminer/). You can basically get a stream of all the objects along with their x and y positions, then group them top-to-bottom, left-to-right (for English at least), then make some intelligent guesses about where cells end based on your knowledge of the context. It's painful and every PDF is different. If you don't have to parse it, don't. How frequently is this published? – ChrisP Jul 11 '13 at 12:15

4 Answers4

5

The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.

Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.

Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.

This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.

Thus reliable text extraction from your document without OCR is impossible after all!

(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Can you point me to some direction if I want to go through the OCR route? – meadhikari Jul 12 '13 at 11:33
  • 1
    Unfortunately no, I've not yet had to resort to OCR myself yet. – mkl Jul 12 '13 at 12:14
  • I'm trying to tackle this as well. Interesting thing I came across: I parse a PDF that clearly looks like it's been generated from html/word document to pdf. When I export it from Acrobat Pro to Word document the table formatting is 100% correct in the output ``.docx`` file. My question is that if the formatting is not there, how does Acrobat make a perfect extraction of the table? – amergin Jan 11 '15 at 04:21
  • @amergin **a** the sample file presented by the original poster did not contain the required information for direct text extraction, but your file may well contain it... **b** Acrobat has an OCR module and do could apply OCR if necessary... **c** how exactly acrobat extracts structure information, is not clear. Probably your PDF does contain additional tags, probably acrobat knows how the program which generated your PDF renders tables, probably it applies generic artificial intelligence to recognize tables... – mkl Jan 11 '15 at 14:41
3

You could use Tabula: http://tabula.nerdpower.org It's free and kinda easy to use

panchtox
  • 634
  • 7
  • 16
  • Have you tried [Tabula](http://source.opennews.org/en-US/articles/introducing-tabula/) on the [document](http://www.nea.org.np/images/supportive_docs/55082070-3-19.pdf) provided by the OP? As I mentioned in my answer the document does *not contain the required information for direct text extraction*, i.e. text extraction using information encoded in the PDF syntax, and Tabula relies on PDFBox for text extraction which only uses such information. Thus, I doubt Tabula will help here now. – mkl Dec 28 '13 at 21:12
  • After your comment, I've used tabula to extract 1st table information as csv. It seems to be working although the text is changed (due to enconding I think). Nevertheless, I don't think I have the technical knowledge to give a more advanced answer. – panchtox Dec 30 '13 at 02:55
  • 1
    Well, the text most likely is changed because the document misses the information for straight forward text extraction, and assumptions made in place of those information are likely false. – mkl Dec 30 '13 at 09:05
  • @franaf: Yesssss! Tabula is getting better and better every week... :-) – Kurt Pfeifle Sep 29 '14 at 23:46
3

Extracting tables from PDF documents is extremely hard as PDF does not contain a semantic layer.

Camelot

You can try camelot, maybe even in combination with its web interface excalibur:

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

See also

Tabula

tabula can be installed via

pip install tabula-py

But it requires Java, as tabula-py is only a wrapper for the Java project.

It's used like this:

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

See also:

AWS Textract

I haven't tried it recently, but AWS Textract claims:

Amazon Textract can extract tables in a document, and extract cells, merged cells, and column headers within a table.

PdfPlumber

pdfplubmer table extraction methods:

import pdfplumber

pdf = pdfplumber.open("example.pdf")
page = pdf.pages[0]
page.extract_table()

See also

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
0

One option is to use pdf-table-extract: https://github.com/ashima/pdf-table-extract.

amergin
  • 962
  • 1
  • 12
  • 32