A website contains some reviews of books, which I would like to scrape with rvest. It's possible to get the data like this:
library(rvest)
library(purrr)
library(tibble)
library(tidyr)
library(dplyr)
result_list <-
lapply(1:2, function(i) {
url <- paste0("http://www.deutschlandradiokultur.de/buchkritik.949.de.html?drbm:page=", i)
parse_url <-
url %>%
xml2::read_html()
parse_page <-
list(page = parse_url %>% html_nodes("span.drk-paginationanzahl") %>% html_text(),
date = parse_url %>% html_nodes(".drk-container") %>% html_nodes(".drk-sendungdatum") %>% html_text(),
text = parse_url %>% html_nodes(".drk-container") %>% html_nodes(".drk-overline") %>% html_text(),
stringsAsFactors=FALSE) %>%
rbind()
})
The length of "date" differ sometimes with "text", so I used list. Now I struggle to convert the list into a dataframe. Do you have some hints for me to converting the list? Maybe there is a more elegant way for webscrape to avoid these... The dataframe should have the columns "page", "date" and "text". (in a next step I split the content of text in author and title)
I tried the approaches:
result_df1 <-
as.data.frame(do.call(rbind, result_list))
result_df2 <-
as.data.frame(do.call(rbind, lapply(result_list, data.frame, stringsAsFactors=FALSE)))
result_df3 <-
as.data.frame(Reduce( rbind, lapply(result_list, unlist) ))
result_df4 <-
as.data.frame(lapply(result_list, unlist))
result_df5 <-
lapply(result_list, tidyr::unnest)
result_df6 <-
result_list %>% purrr::dmap(unlist)
result_df7 <-
result_list %>%
unlist(recursive = FALSE) %>%
tibble::enframe() %>%
unnest()
In result_df1 and result_df2, the dataframe has a list in each cell. How it is possible to unlist these by column? I think a big problem is that the length differed per list element. How can I handle this?
Example1 is similar to my problem with different length in the list. With equal length (example2) I struggle with a convertion to a dataframe too.
example1 <-
list(structure(list("page 1/490",
c("a", "b", "c", "d"),
c("author1: \"title1\"", "author2: \"title2\"", "author3: \"title3\"", "author4: \"title4\""),
FALSE),
.Dim = c(1L, 4L),
.Dimnames = list(".", c("page", "date", "text", "stringsAsFactors"))),
structure(list("page 2/490",
c("e", "f", "g"),
c("author5: \"title5\"", "author6: \"title6\"", "author7: \"title7\"", "author8: \"title8\""),
FALSE),
.Dim = c(1L, 4L),
.Dimnames = list(".", c("page", "date", "text", "stringsAsFactors")))
)
example2 <-
list(structure(list(c("a", "b", "c", "d"),
c("author1: \"title1\"", "author2: \"title2\"", "author3: \"title3\"", "author4: \"title4\""),
FALSE),
.Dim = c(1L, 3L),
.Dimnames = list(".", c("date", "text", "stringsAsFactors"))),
structure(list(c("e", "f", "g", "h"),
c("author5: \"title5\"", "author6: \"title6\"", "author7: \"title7\"", "author8: \"title8\""),
FALSE),
.Dim = c(1L, 3L),
.Dimnames = list(".", c("date", "text", "stringsAsFactors")))
)