1

I am webscraping webpages with rvest and turning the collected data into a dataframe using purrr::map_df. The problem I ran into is that not all webpages have content on every html_nodes that I specify, and map_df is ignoring such incomplete webpages. I would want map_df to include said webpages and write NA wherever a html_nodes does not match content. Take the following code:

library(rvest)
library(tidyverse)

urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome", 
             "https://es.wikipedia.org/wiki/Curic%C3%B3")
h <- urls %>% map(read_html)

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()
  df <- tibble(a, b)
})
out

Here is the output:

> out
# A tibble: 2 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History

The problem here is that the output dataframe does not contain rows for websites which have not match for the #History html node (in this case, the third url). My desired output, looks like this:

> out
# A tibble: 2 x 3
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA

Any help will be greatly appreciated!

NBK
  • 887
  • 9
  • 20

1 Answers1

1

You can just check in the map_df portion. Since html_nodes returns character(0) when it's not there, check the lengths of a and b

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()

  a <- ifelse(length(a) == 0, NA, a)
  b <- ifelse(length(b) == 0, NA, b)

  df <- tibble(a, b)
})
out

# A tibble: 3 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA   
astrofunkswag
  • 2,608
  • 12
  • 25
  • Thanks. This was a duplicate indeed, although the phrasing of the question is quite different. I posted a new question that builds on this one here: https://stackoverflow.com/questions/55962196/r-using-rvest-and-purrrmap-df-to-build-a-dataframe-dealing-with-multiple-ele, if you have any ideas. – NBK May 03 '19 at 01:35