Error in open.connection(x, "rb") : HTTP error 500 when using map_df

Question

I get the error when trying to scrape a news website. I checked, and the website page 32 is broken. I would like to skip the error and keep scraping the rest of the urls.

I have tried the function TryCatch to avoid the broken link, but since I am quite new to R I do not know how to properly write the code. Should I wrap the read_html with that function? If so, how?

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

map_df(0:573, function(i) {

  pagina <- read_html(sprintf(url_silla, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
             date = html_text(html_nodes(pagina, ".date.col-sm-3")),
             category = html_text(html_nodes(pagina, ".category.col-sm-9")),
             tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
             link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
            stringsAsFactors=FALSE)
}) -> noticias_silla

Here is the error. Thanks a lot for any help!

[1] 31
Error in open.connection(x, "rb") : HTTP error 500.
Called from: open.connection(x, "rb")

Does this previous post help? https://stackoverflow.com/q/12193779/5325862 — camille, Aug 08 '19 at 16:26

score 1 · Answer 1 · answered Aug 08 '19 at 17:01

You can build a tryCatch into a function, then pass that function to map_dfr. Set it to return NULL in the event of an error, which won't break the creation of the data frame by map_dfr.

I'd recommend first trying it with map instead, so you can investigate how some indices return the data frame you want, and some return NULL. In either event, the finally argument will print the index.

library(dplyr)
library(purrr)
library(rvest)

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

read_page <- function(i) {
  tryCatch(
    {
      pagina <- read_html(sprintf(url_silla, i, '%s'))
      data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
                 date = html_text(html_nodes(pagina, ".date.col-sm-3")),
                 category = html_text(html_nodes(pagina, ".category.col-sm-9")),
                 tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
                 link = paste0("https://www.lasillavacia.com", trimws(html_attr(html_nodes(pagina, "h3 a"), "href"))),
                 stringsAsFactors=FALSE)
    },
    error = function(cond) return(NULL),
    finally = print(i)
  )
}

noticias <- map_dfr(30:33, read_page)
#> [1] 30
#> [1] 31
#> [1] 32
#> [1] 33

score 0 · Answer 2 · answered Aug 08 '19 at 16:43

The code below only processes pages numbers 31, 32 and 33.

I am not going to use map_* to solve the problem I believe that it might make things more difficult than what they are. I am going to use a standard for loop, since there is no reason why not to.

library(rvest)
library(stringr)
library(tidyverse)

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

pages <- 31:33
noticias_silla <- vector("list", length = length(pages))

for(i in pages){
  p <- sprintf(url_silla, i, '%s', '%s', '%s', '%s')
  pagina <- tryCatch(read_html(p),
                     error = function(e) e)
  print(i)
  if(inherits(pagina, "error")){
    noticias_silla[[i - pages[1] + 1]] <- list(page_num = i, page = p)
  }else{
    noticias_silla[[i - pages[1] + 1]] <- data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
                                           date = html_text(html_nodes(pagina, ".date.col-sm-3")),
                                           category = html_text(html_nodes(pagina, ".category.col-sm-9")),
                                           tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
                                           link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
                                           stringsAsFactors=FALSE)
  }
}

lapply(noticias_silla, class)    noticias_silla[[1]]
noticias_silla[[2]]

#[[1]]
#[1] "data.frame"
#
#[[2]]
#[1] "list"
#
#[[3]]
#[1] "data.frame"    noticias_silla[[1]]
noticias_silla[[2]]

Note that the second list member is a "list", not a "data.frame". This is the one where the error occurred.

noticias_silla[[2]]
#$page_num
#[1] 32
#
#$page
#[1] "https://lasillavacia.com/buscar/farc?page=32"

dave-edison · Accepted Answer · 2019-08-08T17:14:40.983

You can use purrr::possibly:

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

library(tidyverse)
library(rvest)

map_df(0:573, possibly(~{

    pagina <- read_html(sprintf(url_silla, .x, '%s', '%s', '%s', '%s'))

    print(.x)

    data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
               date = html_text(html_nodes(pagina, ".date.col-sm-3")),
               category = html_text(html_nodes(pagina, ".category.col-sm-9")),
               tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
               link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
               stringsAsFactors=FALSE)

}, NULL)) -> noticias_silla

Error in open.connection(x, "rb") : HTTP error 500 when using map_df

3 Answers3