I am webscraping webpages with rvest
and turning the collected data into a dataframe using purrr::map_df
. The problem I ran into is that not all webpages have content on every html_nodes
that I specify, and map_df
is ignoring such incomplete webpages. I would want map_df
to include said webpages and write NA
wherever a html_nodes
does not match content. Take the following code:
library(rvest)
library(tidyverse)
urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
"https://en.wikipedia.org/wiki/Rome",
"https://es.wikipedia.org/wiki/Curic%C3%B3")
h <- urls %>% map(read_html)
out <- h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
b <- html_nodes(., "#History") %>% html_text()
df <- tibble(a, b)
})
out
Here is the output:
> out
# A tibble: 2 x 2
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
The problem here is that the output dataframe does not contain rows for websites which have not match for the #History
html node (in this case, the third url). My desired output, looks like this:
> out
# A tibble: 2 x 3
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
3 Curicó NA
Any help will be greatly appreciated!