I'm trying to webscrape some information from the following website:
https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=0001-D-2014.
I want to iterate over the bill numbers with the following code. I've run this code before on previous years and it has worked well. However, on this year it seems like the connection keeps breaking. I'm listing the code below:
summary2 <- data.frame(matrix(nrow=2, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- sprintf('%0.4d', 1:10048)
for (i in k) {
webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
billno <- html_nodes(webpage, 'h1')
billno_text <- html_text(billno)
billsum <- html_nodes(webpage, '.interno')
billsum_text <- html_text(billsum)
billsum_text <- gsub("\n", "", billsum_text)
billsum_text <- gsub("\t", "", billsum_text)
billsum_text <- gsub(" ", "", billsum_text)
link <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014"))
type <- html_nodes(link, 'h3')
type_text <- html_text(type)
table <-html_node(link, "table.table.table-bordered tbody")
table_text <- html_text(table)
table_text <- gsub("\n", "", table_text)
table_text <- gsub("\t", "", table_text)
table_text <- gsub("", "", table_text)
summary2[i, 1] <- billno_text
summary2[i, 2] <- billsum_text
summary2[i, 3] <- type_text
summary2[i, 4] <- table_text
}
The errors I am getting are the following:
Error in open.connection(x, "rb") : HTTP error 500.
In addition: Warning message:
In for (i in seq_along(cenv$extra)) { :
closing unused connection 3 (https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=0279-D-2014)
The code will stop working at certain bill links even though those links actually seem to work in isolation when I put the links into a browser. I'm not sure why this is breaking.
I tried breaking up the loop to skip the bill links that were not working but this is not an ideal solution because a) it is missing the bill links that aren't working in the code but actually have data that I want to collect, and b) it seems very inefficient.