2

I have an R script that uses rvest to pull some data from accuweather. The accuweather URLs have IDs in them that uniquely correspond to cities. I'm trying to pull IDs in a given range and the associated City names. rvest itself works perfectly for a single ID, but when I iterate through a for loop it eventually returns this error - "Error in open.connection(x, "rb") : HTTP error 502."

I suspect this error is due to the website blocking me out. How do I get around this? I want to scrape from quite a large range (10,000 IDs) and it keeps giving me this error after ~500 iterations of the loop. I also tried closeAllConnections() and Sys.sleep() but to no avail. I'd really appreciate any help with this problem.

EDIT: Solved. I found a way around it through this thread here: Use tryCatch skip to next value of loop upon error?. I used tryCatch() with error = function(e) e as an argument and it suppressed the error message and allowed the loop to continue without breaking. Hopefully, this will be helpful to anyone else stuck on a similar problem.

library(rvest)
library(httr)

# create matrix to store IDs and Cities
# each ID corresponds to a single city 
id_mat<- matrix(0, ncol = 2, nrow = 10001 )

# initialize index for matrix row  
j = 1

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  accu <- read_html(paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = ""))
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
}
user5873424
  • 61
  • 1
  • 5

1 Answers1

2

The problem seems to be coming from scientific notation.

How to disable scientific notation?

I changed your code slightly, now it seems to be working:

library(rvest)
library(httr)

id_mat<- matrix(0, ncol = 2, nrow = 10001 )

readUrl <- function(url) {
out <- tryCatch(
{   
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  return(1)
},
error=function(cond) {

  return(0)
},
warning=function(cond) {
  return(0)
}
)    
return(out)
}

j = 1

options(scipen = 999)

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  url <- paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = "")
  if( readUrl(url)==1) {
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  accu <- read_html("scrapedpage.html")
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
   } else {er <- 1}
  }
  • Okay, so I tried your code and it worked for about ~1300 IDs before I got a similar error message: "In download.file(url, destfile = "scrapedpage.html", quiet = TRUE) : cannot open URL 'https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/349106': HTTP status was '502 Bad Gateway' – user5873424 Jul 05 '19 at 18:02
  • To avoid some errors like this you can refer to this answer, I edited my code above including read_url function from here: https://stackoverflow.com/a/12195574/10710995 – Erdem Emin Akcay Jul 08 '19 at 06:41