R code for downloading all the pdfs given on a site: Web scraping

Question

I want to code in R which can download all the pdfs given on this URL: https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy and then download all the pdfs in a folder. I tried the following code with the help of https://towardsdatascience.com but the code is erroring out as

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx? 
head=Handbook%20of%20Statistics%20on%20Indian%20Economy") %>%

raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>%  # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://rbi.org.in", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.rbi.org.in", .) %>% # prepend the website again to get a full url
for (url in raw_list)
{ download.file(url, destfile = basename(url), mode = "wb") 
}

I am not able to interpret why is the code erroring out. If someone can help me.

score 3 · Accepted Answer · edited May 07 '23 at 22:56

3

When trying to run your code, I ran into "Verify that you are a human" and "Please ensure that your browser has Javascript enabled" dialogues. This suggests that you cannot open the page using Rvest but you need to use RSelenium browser automation instead.

Here is a modified version using RSelenium

library(tidyverse)
library(stringr)
library(purrr)
library(rvest)

library(RSelenium)

rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
page <- remDr$getPageSource()[[1]]
read_html(page) -> html

html %>%
html_nodes("a") %>%  
html_attr("href") %>% 
str_subset("\\.PDF") -> urls
urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF") -> filenames

for(u in 1:length(urls)) {
  cat(paste('downloading: ', u, ' of ', length(urls), '\n'))
  download.file(urls[u], filenames[u], mode='wb')
  Sys.sleep(1)
}

edited May 07 '23 at 22:56

InSync

4,851
4
8
30

answered Oct 27 '21 at 07:29

Otto Kässi

2,943
1
10
27

Re: file names -- you change `filenames` to e.g. `paste0('pdf',seq(1, length(urls),1),'.pdf') -> filenames` – Otto Kässi Nov 01 '21 at 09:15
Hi. Thanks a lot :) – Manu Nov 08 '21 at 07:26
Hi. I just had one more query. So on right hand column you can see data for various years are given ex: 2020, 2019 etc. Can we modify the code somehow so that this code can download data for that years as well. – Manu Dec 15 '21 at 04:56
@Manu - I believe you could use `elem<- remDr$findElement(using="link text", "2020)` and `elem$clickElement()` but I haven't tested this solution. I would recommend asking a new question. – Otto Kässi Dec 15 '21 at 07:32
Hi.Thank you. The web page of 2020 year is opening (as in getting accessed for 2020) but somehow the files are not downloading in system. Yes will ask this as a new question. Thanks a lot – Manu Dec 20 '21 at 05:13

Bloxx · Answer 2 · 2021-10-27T07:30:55.183

there were small mistakes. the Website uses capital letters for PDF endings, and you don't need to use str_c("https://rbi.org.in", .). Finally, I think using purrr's walk2 functions is smoother (as it was probably in the original code).

I haven't executed the code, cos I don't need so many pdfs, so, report if it works.

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
  
  raw_list <- page %>% # takes the page above for which we've read the html
  html_nodes("a") %>%  # find all links in the page
  html_attr("href") %>% # get the url for these links
  str_subset("\\.PDF") %>% 
  walk2(., basename(.), download.file, mode = "wb")

Hi. Thanks. The code worked perfectly. I had just one more doubt. How do I change code if I want pdfs to be named as pdf1, pdf2 and so on instead of default names — Manu, Nov 01 '21 at 09:02

R code for downloading all the pdfs given on a site: Web scraping

2 Answers2

Linked