0

Thanks to StackOverflow, I have been able to use the following code to download a series of photos on a public website.

urls <- c("https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0090/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0089/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0088/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0087/13", 
"https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0086/13"
)

for (url in 1:length(urls)) {

  print(url)
  webpage <- html_session(urls[url])
  link.titles <- webpage %>% html_nodes("img")
  img.url <- link.titles %>% html_attr("src")

  for(j in 1:length(img.url)){

    download.file(img.url[j], paste0(url,'.',j,".jpg"), mode = "wb")
  }

}

However, some links contain no photos, consequently returning an HTTP status error and stopping the download process.

So, I want to insert an if command and tell the R to ignore/ bypass those pages that contain no photos or '404 Not Found' error. The thing is, however, I do not know what function or command would represent a page with no image or '404 Not Found' error. Any suggestions would be appreciated.

Jim O.
  • 1,091
  • 12
  • 31

2 Answers2

2
library(purrr)
library(rvest)
library(httr)

urls <- c(
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0090/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0089/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0088/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0087/13", 
  "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0086/13"
)

sGET <- safely(GET)                                           # make a "safe" version of httr::GET

map(urls, read_html) %>%                                      # read each page
  map(html_nodes, "img") %>%                                  # extract img tags
  flatten() %>%                                               # convert to a simple list
  map_chr(html_attr, "src") %>%                               # extract the URL
  walk(~{                                                     # for each URL
    res <- sGET(.x)                                           # try to retrieve it
    if (!is.null(res$result)) {                               # if there were no fatal errors
      if (status_code(res$result) == 200) {                   # and, if found
        writeBin(content(res$result, as="raw"), basename(.x)) # save it to disk
      }
    }
  })

is an alternative, functional, "safe" way.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • Thank you for sharing! It didn't work for the "bigger set." But, I will certainly have this as a reference : ) – Jim O. Nov 07 '17 at 18:16
  • 1
    What exactly did not work for the "bigger" set? This is a pretty robust solution so I'm curious as to what the specific error condition was. – hrbrmstr Nov 07 '17 at 19:04
  • Yes, of course, this is a very good solution, and I appreciate your time and effort. However, it didn't work for the list I am working with, which contains about 10,000 URLs. I'm rerunning your script, but it's taking awhile, so will post the error once it's finished running. – Jim O. Nov 08 '17 at 14:13
  • 1
    When downloading that many images, you'd be better off with another idiom. One that tests for existence than uses `curl::curl_fetch_multi()` – hrbrmstr Nov 08 '17 at 14:27
0

Just use function "try":

urls <- c("https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0090/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0089/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0088/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0087/13", 
          "https://ec.europa.eu/consumers/consumers_safety/safety_products/rapex/alerts/?event=viewProduct&reference=A12/0086/13"
)

for (url in 1:length(urls)) {

  print(url)
  webpage <- html_session(urls[url])
  link.titles <- webpage %>% html_nodes("img")
  img.url <- link.titles %>% html_attr("src")

  for(j in 1:length(img.url)){

    try(download.file(img.url[j], paste0(url,'.',j,".jpg"), mode = "wb")
        ,silent = TRUE)

}

In addition, you can add conditions "if":

for (url in 1:length(urls)) {

  print(url)
  webpage <- html_session(urls[url])
  link.titles <- webpage %>% html_nodes("img")
  img.url <- link.titles %>% html_attr("src")

  for(j in 1:length(img.url)){

    try_download <- try(
      download.file(img.url[j], paste0(url,'.',j,".jpg"), mode = "wb")
      ,silent = TRUE)

    if(is(try_download,"try-error")){
      print(paste0("ERROR: ", img.url[j]))
    }else{
      print(paste0("Downloaded: ", img.url[j]))
    }

}
rhyte
  • 1