1

I am getting this error when running my code:

Error in data.frame(date = html_text(html_nodes(pagina, ".node-post-date")),  : 
  arguments imply differing number of rows: 9, 10

When scraping the tag in the page 983, I only get 9 results (instead of the usual 10 results for each page). I think this is happening because in that web page one of the dates I want to scrape has a different tag to the one I am using.

I am quite new to R so I do not know how to run an if statement in my code to get an NA for the result I am not getting.

Here it is my code:

#Libraries
library(rvest)
library(purrr)
library(tidytext)
library(dplyr)

url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'

map_df(980:990, function(i) {

  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
             date = html_text(html_nodes(pagina, ".node-post-date")),
             link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
             stringsAsFactors=FALSE)
  }) -> noticias_espectador

Besides the if statement, is there any other solution to this? I am going to scrape a large number of pages so I need to avoid this row matching problem. Thanks for your help!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Jose David
  • 139
  • 9
  • You may want to have a look at `tryCatch` [function](https://stackoverflow.com/questions/12193779/how-to-write-trycatch-in-r) in R. In terms of error, you can skip to next step by using it. – maydin Aug 06 '19 at 09:57
  • You were right, @QHarr. I already edited. Thanks! – Jose David Aug 06 '19 at 12:05

1 Answers1

1

You could use css Or syntax to add the other class (suitable when small number of additional classes).

Alternatively, you could select for a shared parent node, test if a particular child is present, return NA if not. This answer shows you the latter approach. If you use the latter a suitable parent node can be got with selector .node--search-result - you may miss the actual child of interest (as in this case where different class) - but code won't error out.

There is a third option - the classes have a common suffix, in cases observed, so you could use an attribute = value css selector, with either contains (*), or ends with ($) operator e.g. date = html_text(html_nodes(pagina, "[class$='post-date']")).

library(rvest)
library(purrr)
library(tidytext)
library(dplyr)

url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'

map_df(980:990, function(i) {

  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
             date = html_text(html_nodes(pagina, ".node-post-date, .field--name-post-date")),
             link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
             stringsAsFactors=FALSE)
}) -> noticias_espectador
QHarr
  • 83,427
  • 12
  • 54
  • 101