2

I get the error when trying to scrape a news website. I checked, and the website page 32 is broken. I would like to skip the error and keep scraping the rest of the urls.

I have tried the function TryCatch to avoid the broken link, but since I am quite new to R I do not know how to properly write the code. Should I wrap the read_html with that function? If so, how?

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

map_df(0:573, function(i) {

  pagina <- read_html(sprintf(url_silla, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
             date = html_text(html_nodes(pagina, ".date.col-sm-3")),
             category = html_text(html_nodes(pagina, ".category.col-sm-9")),
             tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
             link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
            stringsAsFactors=FALSE)
}) -> noticias_silla

Here is the error. Thanks a lot for any help!

[1] 31
Error in open.connection(x, "rb") : HTTP error 500.
Called from: open.connection(x, "rb")
Jose David
  • 139
  • 9

3 Answers3

1

You can build a tryCatch into a function, then pass that function to map_dfr. Set it to return NULL in the event of an error, which won't break the creation of the data frame by map_dfr.

I'd recommend first trying it with map instead, so you can investigate how some indices return the data frame you want, and some return NULL. In either event, the finally argument will print the index.

library(dplyr)
library(purrr)
library(rvest)

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

read_page <- function(i) {
  tryCatch(
    {
      pagina <- read_html(sprintf(url_silla, i, '%s'))
      data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
                 date = html_text(html_nodes(pagina, ".date.col-sm-3")),
                 category = html_text(html_nodes(pagina, ".category.col-sm-9")),
                 tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
                 link = paste0("https://www.lasillavacia.com", trimws(html_attr(html_nodes(pagina, "h3 a"), "href"))),
                 stringsAsFactors=FALSE)
    },
    error = function(cond) return(NULL),
    finally = print(i)
  )
}

noticias <- map_dfr(30:33, read_page)
#> [1] 30
#> [1] 31
#> [1] 32
#> [1] 33
camille
  • 16,432
  • 18
  • 38
  • 60
0

The code below only processes pages numbers 31, 32 and 33.

I am not going to use map_* to solve the problem I believe that it might make things more difficult than what they are. I am going to use a standard for loop, since there is no reason why not to.

library(rvest)
library(stringr)
library(tidyverse)

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

pages <- 31:33
noticias_silla <- vector("list", length = length(pages))

for(i in pages){
  p <- sprintf(url_silla, i, '%s', '%s', '%s', '%s')
  pagina <- tryCatch(read_html(p),
                     error = function(e) e)
  print(i)
  if(inherits(pagina, "error")){
    noticias_silla[[i - pages[1] + 1]] <- list(page_num = i, page = p)
  }else{
    noticias_silla[[i - pages[1] + 1]] <- data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
                                           date = html_text(html_nodes(pagina, ".date.col-sm-3")),
                                           category = html_text(html_nodes(pagina, ".category.col-sm-9")),
                                           tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
                                           link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
                                           stringsAsFactors=FALSE)
  }
}

lapply(noticias_silla, class)    noticias_silla[[1]]
noticias_silla[[2]]

#[[1]]
#[1] "data.frame"
#
#[[2]]
#[1] "list"
#
#[[3]]
#[1] "data.frame"    noticias_silla[[1]]
noticias_silla[[2]]

Note that the second list member is a "list", not a "data.frame". This is the one where the error occurred.

noticias_silla[[2]]
#$page_num
#[1] 32
#
#$page
#[1] "https://lasillavacia.com/buscar/farc?page=32"
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

You can use purrr::possibly:

url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'

library(tidyverse)
library(rvest)

map_df(0:573, possibly(~{

    pagina <- read_html(sprintf(url_silla, .x, '%s', '%s', '%s', '%s'))

    print(.x)

    data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
               date = html_text(html_nodes(pagina, ".date.col-sm-3")),
               category = html_text(html_nodes(pagina, ".category.col-sm-9")),
               tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
               link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
               stringsAsFactors=FALSE)

}, NULL)) -> noticias_silla
dave-edison
  • 3,666
  • 7
  • 19