3

I have a PDF with many tables in it, and I'm trying to parse them into a more readable format using R. So far, I've tried two methods:

  1. using pdftools::pdftext() to get the text, then basically using regexes to manually read in the tables (honestly wasn't as bad as it sounds)
  2. using tabulizer::extract_tables(), which somehow magically does all the work for me (it's kinda slow but bearable)

Both methods were surprisingly good, but still had some issues related to messing up the columns/alignment - sometimes columns were combined, sometimes headers were misaligned with the data columns, etc. I'm willing to sort of brute force wrangle the data, but before I try that I just want to see if there are smarter ways to do this.

So, are there better ways to read in tables from PDFs?

zx8754
  • 52,746
  • 12
  • 114
  • 209
AWhite
  • 75
  • 7
  • 1
    I struggled with the same thing. I found it easier to convert my `pdf`s to `docx` and use the `officer` package. Namely `officer::docx_summary(officer::read_docx(x))`. `docx` is essentially an archive container with `xml` in it, which I've found easiest to work with. – Anonymous coward Jul 20 '18 at 22:05
  • Can you provide an example of a PDF for which you want to extract tables? – Emmanuel Hamel Sep 15 '22 at 15:59

0 Answers0