I'm using RVEST to scrape dates from a set of HTML files I have stored. The way they are stored does not give me the full URL path, so in my parsing, I have a loop that captures the URL if available (most of them return a value for the below loop).
setwd("/directory/Apr")
#get list of html files
htmlfiles <- list.files(pattern= ".html")
#Return Page
page <- sapply(htmlfiles, function(file){
link <-read_html(file) %>%
html_node(xpath = ".//link[contains(@rel, 'canonical')]") %>%
html_attr("href")
link
})
#loop through HTML for "Last Updated"
last_updated <- sapply(htmlfiles, function(file){
datetime <-read_html(file) %>%
html_node(xpath = ".//meta[contains(@property, 'last_updated')]") %>%
html_attr("content")
datetime
})
I've tested this code on a subset of my dataset and it seems to work just fine. However, when applied to my full dataset, I'm returning the error:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "xml_document"
I do not understand what this error means or how to resolve it.