2

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do not want the French part (je suis désolé). I have tried using tabulizer extract_area(), but I don't want to have to select the area by hand 90 times (I'm going to do this for multiple pieces of legislation).

Obviously I don't have a minimal reproducible example coded out... But the pdf is downloadable here: https://laws-lois.justice.gc.ca/eng/acts/F-27/

Option 2 is to write something to pull it out via XML, but I'm a bit less used to working with XML files. Unless it's incredibly annoying to do using either pdftools or tabulizer, I'd prefer the answer using one of those libraries (mostly for learning purposes).

I've seen some similarish questions on stackoverflow, but they're all confusingly written/designed for tables, of which this is not. I am not a quant/data science researcher by training, so an explanation would be super helpful (but not required).

TOC

Content of the Legislation

  • You could add a French dictionary to your stopwords list - this means you'd still extract it but you'd remove all of the French during your data cleaning step. – bstrain Feb 27 '21 at 04:39
  • Maybe I'm misinterpreting your response but, using stopwords would be an ineffective way to do this. There are many words in English that are the same in French. Eg. In pictures above, "classification" is used exactly the same – Alex Betsos Feb 27 '21 at 11:25

1 Answers1

0

Here's an option that reads in the pdf text and detects language. You're probably going to have to do a lot of text cleanup after reading in the pdf. Assume you don't care about retaining formatting.

library(pdftools)
a = pdf_text('F-27.pdf')

#split text to get sentence chunks, mostly.
b = sapply(a,strsplit,'\r\n')

#do a bunch of other text cleanup, here's an example using the third list element. You can expand this to cover all of b with a loop or list function like sapply. 
#Two spaces should hopefully retain most sentence-like fragments, you can get more sophisticated:
d = strsplit(b[[3]], '  ')[[1]]

library(cld3) #language tool to detect french and english
x = sapply(d,detect_language)

#Keep only English
x[x=='en']
rdodhia
  • 350
  • 2
  • 9