Extracting Tables from PDF files in Ruby

Question

What is the best way for extracting Tables which are embedded in PDF documents?

I am not interested solutions which work only for JRuby, or which make use of third-party APIs or web-sites.

Can you share some Ruby code on how to extract the table(s)? Which gems are best suited for the job?

I'm sure someone has had the same problem before :) I appreciate your help!

Extracting data organized in structured layouts from PDF files is much harder than you might anticipate and it is unlikely that you will be able to get a very reliable solution that works on arbitrary PDF files. — lorefnon, Jan 28 '17 at 19:19
I have PDF files which get generated by one company, so I was hoping that they don't change their PDF generation tool, and always use the same format — Tilo, Jan 28 '17 at 23:19
What are you trying to do with the tables? Do you want them in html format? text format? — BigRon, Feb 03 '17 at 04:59
I want to extract the text from the tables and get the data as strings for each column/row — Tilo, Feb 06 '17 at 19:45
Both @ZachTuttle's [answer](http://stackoverflow.com/a/41963389/2628223) and [mine](http://stackoverflow.com/a/42017186/2628223) would solve this problem. — BigRon, Feb 07 '17 at 20:22

score 3 · Answer 1 · answered Jan 29 '17 at 19:03

3

You may want to take a look at this answer (How to convert PDF to Excel or CSV in Rails 4). It solves the same problem you are trying to solve.

answered Jan 29 '17 at 19:03

theterminalguy

1,842
18
20

Thank you for your answer, but unfortunately I can not use a third-party API / site to process the PDFs, because they contain confidential data – Tilo Jan 29 '17 at 19:15
1

@Tilo: AFAIK, this gem extracts the table locally, without needing any third-party to process the PDF. Sure, you need to trust the code, but you can audit it before launching it. – Eric Duminil Feb 02 '17 at 16:53

score 2 · Answer 2 · answered Jan 31 '17 at 17:06

2

Checkout this gem I think it's what your looking for: pdf-reader gem

answered Jan 31 '17 at 17:06

Zach Tuttle

2,165
1
14
17

1

this looks good, but I don't see any special handling of tables – Tilo Feb 07 '17 at 21:30

score 1 · Accepted Answer · answered Feb 03 '17 at 05:10

1

You can extract data from a pdf with poppler. Depending on your exact requirements, this may be sufficient.

def extract_to_text(pdf_path)
  command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ')
  `#{command}`
end

def extract_to_html(pdf_path)
  command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ')
  `#{command}`
end

These commands will extract the pdfs to an html file and text file, respectively, saved at the same location where your pdf was.

You can install poppler on a mac with homebrew:

brew install poppler

answered Feb 03 '17 at 05:10

BigRon

3,182
3
22
47

this might be an interesting link https://github.com/ashima/pdf-table-extract to do the actual table extraction – Tilo Feb 07 '17 at 21:31
@Tilo yes that looks promising. It's in Python but you could mimic the logic. – BigRon Feb 08 '17 at 02:19

score 0 · Answer 4 · answered Apr 14 '20 at 14:43

0

There is a gem called Iguvium that does exactly this. Here is an example

pages = Iguvium.read('filename.pdf')
tables = pages.first.extract_tables!
csv = tables.first.to_a.map(&:to_csv).join

answered Apr 14 '20 at 14:43

Weston Ganger

6,324
4
41
39

Extracting Tables from PDF files in Ruby

4 Answers4