Reading tables from PDF in R

Asked Jul 20 '18 at 21:32

Active Jan 13 '21 at 08:51

Viewed 281 times

I have a PDF with many tables in it, and I'm trying to parse them into a more readable format using R. So far, I've tried two methods:

using pdftools::pdftext() to get the text, then basically using regexes to manually read in the tables (honestly wasn't as bad as it sounds)
using tabulizer::extract_tables(), which somehow magically does all the work for me (it's kinda slow but bearable)

Both methods were surprisingly good, but still had some issues related to messing up the columns/alignment - sometimes columns were combined, sometimes headers were misaligned with the data columns, etc. I'm willing to sort of brute force wrangle the data, but before I try that I just want to see if there are smarter ways to do this.

So, are there better ways to read in tables from PDFs?

edited Jan 13 '21 at 08:51

zx8754

52,746
12
114
209

asked Jul 20 '18 at 21:32

AWhite

1

I struggled with the same thing. I found it easier to convert my `pdf`s to `docx` and use the `officer` package. Namely `officer::docx_summary(officer::read_docx(x))`. `docx` is essentially an archive container with `xml` in it, which I've found easiest to work with. – Anonymous coward Jul 20 '18 at 22:05
Can you provide an example of a PDF for which you want to extract tables? – Emmanuel Hamel Sep 15 '22 at 15:59

Reading tables from PDF in R

0 Answers0