2

I tried to cache read_html/xml2 to avoid flooding the server during development

library(digest)
library(xml2)
url = "https://en.wikipedia.org"
cache = digest(url)
if (file.exists(cache)) {
  cat("Reading from cache\n")
  html = readRDS(cache)
} else {
  #Sys.sleep(3)
  cat("Reading from web\n")
  html = xml2::read_html(url) 
  saveRDS(html, file = cache)
}
html

This fails, because only externalpointers are stored in the file which are no longer valid on re-run. The same problem occurs when I use memoise on read_html.

Dieter Menne
  • 10,076
  • 44
  • 67

1 Answers1

2

You can always use as_list and as_xml_document to convert back and forth.

library(digest)
library(xml2)
url = "https://en.wikipedia.org"
cache = digest(url)
if (file.exists(cache)) {
  cat("Reading from cache\n")
  html = as_xml_document(readRDS(cache))
} else {
  cat("Reading from web\n")
  html = read_html(url) 
  saveRDS(as_list(html), file = cache)
}
html

Alternatively, look into read_xml and write_xml.

d125q
  • 1,666
  • 12
  • 18
  • Thanks for the idea, but I gave up with it and do a cache on derived quantities. as_list(html) is TERRIBLY slow (1 Minute for the Wiki-Page), and I only find part of the attributes afterwards - looks like there is some loss in translation. – Dieter Menne Aug 21 '18 at 15:39