Webscraping with 'rvest' and code keeps stopping

Question

I'm trying to webscrape some information from the following website:

https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=0001-D-2014.

I want to iterate over the bill numbers with the following code. I've run this code before on previous years and it has worked well. However, on this year it seems like the connection keeps breaking. I'm listing the code below:

summary2 <- data.frame(matrix(nrow=2, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- sprintf('%0.4d', 1:10048)


for (i in k) {
  webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
  billno <- html_nodes(webpage, 'h1')
  billno_text <- html_text(billno)
  
  billsum <- html_nodes(webpage, '.interno')
  billsum_text <- html_text(billsum)
  
  billsum_text <- gsub("\n", "", billsum_text)
  billsum_text <- gsub("\t", "", billsum_text)
  billsum_text <- gsub("    ", "", billsum_text)
  
  link <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014"))
  type <- html_nodes(link, 'h3')
  type_text <- html_text(type)
  
  
  table <-html_node(link, "table.table.table-bordered tbody")
  
  table_text <- html_text(table)
  
  table_text <- gsub("\n", "", table_text)
  table_text <- gsub("\t", "", table_text)
  table_text <- gsub("", "", table_text)
  
  summary2[i, 1] <- billno_text
  summary2[i, 2] <- billsum_text
  summary2[i, 3] <- type_text
  summary2[i, 4] <- table_text
}

The errors I am getting are the following:

Error in open.connection(x, "rb") : HTTP error 500.
In addition: Warning message:
In for (i in seq_along(cenv$extra)) { :
  closing unused connection 3 (https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=0279-D-2014)

The code will stop working at certain bill links even though those links actually seem to work in isolation when I put the links into a browser. I'm not sure why this is breaking.

I tried breaking up the loop to skip the bill links that were not working but this is not an ideal solution because a) it is missing the bill links that aren't working in the code but actually have data that I want to collect, and b) it seems very inefficient.

A possibility is to try again a few times if it returns an error adding a `sys.sleep(2)` in between. The service might not like to be reloaded as often as your code can do it. — harre, Aug 30 '22 at 13:48
However, I havn't been able to see any content here: `https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=` (tried a few times during the last hour). — harre, Aug 30 '22 at 13:49
I think I'm seeing the issue. Even if I'm able to get the first link to work, I think it is stopping at the second link because these bills/hyperlinks do not exist (https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?pagina=387) as you can see there is no 279. Is there anyway to deal with this in the loop or will I just have to break the loop up into smaller sections around these missing bills? — Kaitlin, Aug 30 '22 at 14:17
You could use `tryCatch`, so that it doesn't break. If it returns e.g. 500 `type_text <- NA`, `table_text <- NA`. — harre, Aug 30 '22 at 15:43
You could see how here e.g. https://stackoverflow.com/questions/39056103/iterating-rvest-scrape-function-gives-error-in-open-connectionx-rb-time — harre, Aug 30 '22 at 15:50
Thanks for your help! I think this is something that would work. I'm just unsure as to how to implement this with my code as I am iterating over links by using the loop instead of starting with a list of links as that post is. I'm very new to webscraping in R so I'm not exactly sure how to adopt the tryCatch function to my code. — Kaitlin, Aug 30 '22 at 16:18

harre · Accepted Answer · 2022-08-31T10:04:20.093

You could escape the error using tryCatch and add NA's to your table in those cases:

library(rvest)

summary2 <- data.frame(matrix(nrow=0, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- c("0278", "0279", "0280")

for (i in k) {

  ## First scrape ##

  # sys.sleep(1) # Uncomment if ness.

  webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
  billno <- html_nodes(webpage, 'h1')
  billno_text <- html_text(billno)
  
  billsum <- html_nodes(webpage, '.interno')
  billsum_text <- html_text(billsum)
  
  billsum_text <- gsub("\n", "", billsum_text)
  billsum_text <- gsub("\t", "", billsum_text)
  billsum_text <- gsub("    ", "", billsum_text)
  
  ## Second scrape ##

  # sys.sleep(1) # Uncomment if ness.

  link <- tryCatch(read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014")),
                   error = function(e) NA)
  
  if (is.na(link)) {
    
    type_text <- NA
    table_text <- NA
    
  } else {
  
    type <- html_nodes(link, 'h3')
    type_text <- html_text(type)
    table <-html_node(link, "table.table.table-bordered tbody")
  
    table_text <- html_text(table)
  
    table_text <- gsub("\n", "", table_text)
    table_text <- gsub("\t", "", table_text)
    table_text <- gsub("", "", table_text)
    
  }
  
  ## Output ##

  summary2[i, 1] <- billno_text
  summary2[i, 2] <- billsum_text
  summary2[i, 3] <- type_text
  summary2[i, 4] <- table_text
}

Output:

tibble::as_tibble(summary2)
# A tibble: 3 × 4
  billnum     sum                                                                                                           type  name_…¹
  <chr>       <chr>                                                                                                         <chr> <chr>  
1 0278-D-2014 "0278-D-2014  ProyectoSu beneplácito por el reconocimiento que la revista científica Nature realizara a un g… " PR… "ASSEF…
2 0279-D-2014 "0279-D-2014  ProyectoSu Benplacito al conmemorarase  el  natalicio de el Dr.  Joaquin V.  Gonzalezel 6 de m…  NA    NA    
3 0280-D-2014 "0280-D-2014  ProyectoLA HONORABLE CAMARA DE DIPUTADOS EXPRESA SU ADHESIÓN AL CONMEMORARSE EL 07 DE MARZO \"… " PR… "GRANA…
# … with abbreviated variable name ¹name_dis_part

Thanks so much for your response! I think this is working well, but I'm still getting some time out errors: Error in open.connection(x, "rb") : Timeout was reached: [www.hcdn.gob.ar] Connection timed out after 10006 milliseconds In addition: There were 39 warnings (use warnings() to see them) Is there maybe a place I should add in the sys.sleep(x) function? — Kaitlin, Aug 30 '22 at 19:45
Yes, you could try to insert `sys.sleep(1)` where the for loop starts and between the two links to give it a bit of a rest (see code comments in my post - just updated). You might also want to make a `tryCatch` around the first link - and check out how to close connections: https://stackoverflow.com/questions/37839566/how-do-i-close-unused-connections-after-read-html-in-r — harre, Aug 31 '22 at 10:02
You might also want to check this. The error could be due to you being behind a proxy: https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached — harre, Aug 31 '22 at 10:08

Webscraping with 'rvest' and code keeps stopping

1 Answers1