0

I'm using RVEST to scrape dates from a set of HTML files I have stored. The way they are stored does not give me the full URL path, so in my parsing, I have a loop that captures the URL if available (most of them return a value for the below loop).

setwd("/directory/Apr")

#get list of html files
htmlfiles <- list.files(pattern= ".html")

#Return Page
page <- sapply(htmlfiles, function(file){
  link <-read_html(file) %>%
    html_node(xpath = ".//link[contains(@rel, 'canonical')]") %>% 
    html_attr("href")
  link
})

#loop through HTML for "Last Updated"
last_updated <- sapply(htmlfiles, function(file){
  datetime <-read_html(file) %>%
    html_node(xpath = ".//meta[contains(@property, 'last_updated')]") %>% 
    html_attr("content")
  datetime
})

I've tested this code on a subset of my dataset and it seems to work just fine. However, when applied to my full dataset, I'm returning the error:

Error in UseMethod("xml_find_first") : 
  no applicable method for 'xml_find_first' applied to an object of class "xml_document"

I do not understand what this error means or how to resolve it.

MAb2021
  • 127
  • 9
  • 1
    It sounds like one of the files may be malformed. Maybe add a `print(file)` to the sapply function so it will print the name of the file it gets the error on – MrFlick Jul 10 '21 at 06:12
  • It seems to be erroring on one of the HTML files but I cannot figure out why. From what I can see, it seems to be just like the other files near it in the sequence. With so many of them, is there a way to skip files that error? Would wrapping the loop in a try() help? – MAb2021 Jul 10 '21 at 06:43
  • If you want to skip errors, see: https://stackoverflow.com/questions/2589275/how-to-tell-lapply-to-ignore-an-error-and-process-the-next-thing-in-the-list – MrFlick Jul 10 '21 at 06:48
  • I think I'll need to in this case, there are too many files to go through one by one for errors. Thank you for your help, I really appreciate it! – MAb2021 Jul 10 '21 at 06:55

0 Answers0