2

I have a website that has several hundred PDFs. I need to iterate through, and download every PDF to my local machine. I would like to use . Attempt:

library(rvest)

url <- "https://example.com"

scrape <- url %>% 
  read_html() %>% 
  html_node(".ms-vb2 a") %>%
  download.file(., 'my-local-directory')

How do I grab each PDF from the link? The download.file() does not work, and I have no clue how to get each file. I just get this error:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : xmlParseEntityRef: no name [68]

papelr
  • 468
  • 1
  • 11
  • 42

1 Answers1

2
library(rvest)

url <- "https://example.com"
page<- html_session(url,config(ssl_verifypeer=FALSE))

links<-page %>% html_nodes(".ms-vb2 a") %>% html_attr("href")
subject<-page %>% html_nodes(".ms-vb2:nth-child(3)") %>% html_text()
name<-links<-page %>% html_nodes(".ms-vb2 a") %>% html_text()

for(i in 1:length(links)){
  pdf_page<-html_session(URLencode(paste0("https://example.com",links[i])),config(ssl_verifypeer=FALSE))
  writeBin(paste0(name[i],"-",subject[i],".pdf")
}

The URL is http so had to use the config(ssl_verifypeer=FALSE)

writeBin name the file according to your necessity. I have just named it ok_1.pdf ok_2.pdf and so on

papelr
  • 468
  • 1
  • 11
  • 42
Bharath
  • 1,600
  • 14
  • 25
  • If I wanted to name the files what they're named on the website, can I replace the `"ok_%d.pdf"` with something like `[links].pdf`? – papelr Dec 17 '18 at 19:21
  • Like for example, on the site there's a "Subject" line. Can I change the name of the saved file to the "Name" and "Subject" combined? Throw another `paste0` in there? Or is that a substantial lift? – papelr Dec 17 '18 at 19:45
  • I have edited the code now as per your needs. @papelr – Bharath Dec 17 '18 at 20:47