1

Consider this simple example

library(rvest)
library(tidyverse)
library(dplyr)
library(lubridate)
library(tibble)

mytib <- tibble(mylink = c('https://en.wikipedia.org/wiki/List_of_software_bugs',
                           'https://en.wikipedia.org/wiki/Software_bug'))


mytib <- mytib %>% mutate(html.data = map(mylink, ~read_html(.x)))

> mytib
# A tibble: 2 x 2
  mylink                                              html.data 
  <chr>                                               <list>    
1 https://en.wikipedia.org/wiki/List_of_software_bugs <xml_dcmn>
2 https://en.wikipedia.org/wiki/Software_bug          <xml_dcmn>

> mytib$html.data[1]
[[1]]
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="UTF-8">\n<title> ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-List_of_software_b ...

As you can see, my tibble correctly contains the html code of the two different wikipedia pages stored in the column mylink. The problem is that I am not able to store this hard-worked scraping to disk. A simple read_csv will fail

> mytib %>% write_csv('mydata.csv')
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) : 
  Don't know how to handle vector of type list.

while write to rds will not work correctly

mytib %>% write_rds('mydata.rds')
test <- read_rds('mydata.rds')
test$html.data[1]

> test$html.data[1]
[[1]]
Error in doc_type(x) : external pointer is not valid

What should I do? In which format should I store my data? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

2 Answers2

1

Do you really need to store entire html in csv? html as in itself isn't useful, you may want to extract relevant parts that are needed and store it in a column. For example, extracting the title here.

library(dplyr)     
library(rvest)
library(purrr)

mytib %>% 
  mutate(html.data = map(mylink, read_html), 
         title = map_chr(html.data,~.x %>% html_nodes('title') %>% html_text)) %>%
  select(-html.data) %>%
  write.csv('data.csv', row.names = FALSE)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

The reason for this has been discussed here.
As a workaround, you can convert xmlDoc to string in order to save it :

mytib <- mytib %>% mutate(html.data = map(mylink, ~toString(read_html(.x))))
mytib %>% write_rds('mydata.rds')
test <- read_rds('mydata.rds')
test$html.data[[1]]
[1] "<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<meta charset=\"UTF-8\">\n<title>List of software bugs - Wikipedia</title>\n

You can then recreate a xml document :

test %>% mutate(xmlDoc = map(html.data,~read_html(.x))
# A tibble: 2 x 3
  mylink                                              html.data xmlDoc    
  <chr>                                               <list>    <list>    
1 https://en.wikipedia.org/wiki/List_of_software_bugs <chr [1]> <xml_dcmn>
2 https://en.wikipedia.org/wiki/Software_bug          <chr [1]> <xml_dcmn>
Waldi
  • 39,242
  • 6
  • 30
  • 78