I have a PDF with many tables in it, and I'm trying to parse them into a more readable format using R. So far, I've tried two methods:
- using
pdftools::pdftext()
to get the text, then basically using regexes to manually read in the tables (honestly wasn't as bad as it sounds) - using
tabulizer::extract_tables()
, which somehow magically does all the work for me (it's kinda slow but bearable)
Both methods were surprisingly good, but still had some issues related to messing up the columns/alignment - sometimes columns were combined, sometimes headers were misaligned with the data columns, etc. I'm willing to sort of brute force wrangle the data, but before I try that I just want to see if there are smarter ways to do this.
So, are there better ways to read in tables from PDFs?