I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do not want the French part (je suis désolé). I have tried using tabulizer extract_area()
, but I don't want to have to select the area by hand 90 times (I'm going to do this for multiple pieces of legislation).
Obviously I don't have a minimal reproducible example coded out... But the pdf is downloadable here: https://laws-lois.justice.gc.ca/eng/acts/F-27/
Option 2 is to write something to pull it out via XML, but I'm a bit less used to working with XML files. Unless it's incredibly annoying to do using either pdftools
or tabulizer
, I'd prefer the answer using one of those libraries (mostly for learning purposes).
I've seen some similarish questions on stackoverflow, but they're all confusingly written/designed for tables, of which this is not. I am not a quant/data science researcher by training, so an explanation would be super helpful (but not required).