I have an R script that uses rvest to pull some data from accuweather. The accuweather URLs have IDs in them that uniquely correspond to cities. I'm trying to pull IDs in a given range and the associated City names. rvest itself works perfectly for a single ID, but when I iterate through a for loop it eventually returns this error - "Error in open.connection(x, "rb") : HTTP error 502."
I suspect this error is due to the website blocking me out. How do I get around this? I want to scrape from quite a large range (10,000 IDs) and it keeps giving me this error after ~500 iterations of the loop. I also tried closeAllConnections()
and Sys.sleep()
but to no avail. I'd really appreciate any help with this problem.
EDIT: Solved. I found a way around it through this thread here: Use tryCatch skip to next value of loop upon error?. I used tryCatch()
with error = function(e) e
as an argument and it suppressed the error message and allowed the loop to continue without breaking. Hopefully, this will be helpful to anyone else stuck on a similar problem.
library(rvest)
library(httr)
# create matrix to store IDs and Cities
# each ID corresponds to a single city
id_mat<- matrix(0, ncol = 2, nrow = 10001 )
# initialize index for matrix row
j = 1
for (i in 300000:310000){
z <- as.character(i)
# pull city name from website
accu <- read_html(paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = ""))
citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
id_mat[j,1] = i
id_mat[j,2] = citystate
# increment by 1
i = i + 1
j = j + 1
# close connection after 200 pulls, wait 5 mins and loop again
if (i %% 200 == 0) {
closeAllConnections()
Sys.sleep(300)
next
} else {
# sleep for 1 or 2 seconds every loop
Sys.sleep(sample(2,1))
}
}