6

What is the best way for extracting Tables which are embedded in PDF documents?

I am not interested solutions which work only for JRuby, or which make use of third-party APIs or web-sites.

Can you share some Ruby code on how to extract the table(s)? Which gems are best suited for the job?

I'm sure someone has had the same problem before :) I appreciate your help!

Tilo
  • 33,354
  • 5
  • 79
  • 106
  • Extracting data organized in structured layouts from PDF files is much harder than you might anticipate and it is unlikely that you will be able to get a very reliable solution that works on arbitrary PDF files. – lorefnon Jan 28 '17 at 19:19
  • I have PDF files which get generated by one company, so I was hoping that they don't change their PDF generation tool, and always use the same format – Tilo Jan 28 '17 at 23:19
  • What are you trying to do with the tables? Do you want them in html format? text format? – BigRon Feb 03 '17 at 04:59
  • I want to extract the text from the tables and get the data as strings for each column/row – Tilo Feb 06 '17 at 19:45
  • Both @ZachTuttle's [answer](http://stackoverflow.com/a/41963389/2628223) and [mine](http://stackoverflow.com/a/42017186/2628223) would solve this problem. – BigRon Feb 07 '17 at 20:22

4 Answers4

3

You may want to take a look at this answer (How to convert PDF to Excel or CSV in Rails 4). It solves the same problem you are trying to solve.

theterminalguy
  • 1,842
  • 18
  • 20
  • Thank you for your answer, but unfortunately I can not use a third-party API / site to process the PDFs, because they contain confidential data – Tilo Jan 29 '17 at 19:15
  • 1
    @Tilo: AFAIK, this gem extracts the table locally, without needing any third-party to process the PDF. Sure, you need to trust the code, but you can audit it before launching it. – Eric Duminil Feb 02 '17 at 16:53
2

Checkout this gem I think it's what your looking for: pdf-reader gem

Zach Tuttle
  • 2,165
  • 1
  • 14
  • 17
1

You can extract data from a pdf with poppler. Depending on your exact requirements, this may be sufficient.

def extract_to_text(pdf_path)
  command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ')
  `#{command}`
end

def extract_to_html(pdf_path)
  command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')
  `#{command}`
end

These commands will extract the pdfs to an html file and text file, respectively, saved at the same location where your pdf was.

You can install poppler on a mac with homebrew:

brew install poppler
BigRon
  • 3,182
  • 3
  • 22
  • 47
  • this might be an interesting link https://github.com/ashima/pdf-table-extract to do the actual table extraction – Tilo Feb 07 '17 at 21:31
  • @Tilo yes that looks promising. It's in Python but you could mimic the logic. – BigRon Feb 08 '17 at 02:19
0

There is a gem called Iguvium that does exactly this. Here is an example

pages = Iguvium.read('filename.pdf')
tables = pages.first.extract_tables!
csv = tables.first.to_a.map(&:to_csv).join 
Weston Ganger
  • 6,324
  • 4
  • 41
  • 39