Try to fill with " " in case of error in loop of html_nodes

Question

I'm trying to extract data by text mining with html_nodes using urls that I have saved into an object called url. I have created a loop which reads and scrapes each url.

library(rvest)
for (i in url) {
  tex <- read_html(i)
  p_text <- tex %>%
    html_nodes("p") %>%
    html_text()
  a <- p_text
}

Because some url isn't working, the following message appears:

Error in open.connection(x, "rb") : Could not resolve host: app.lo

I want to introduce in the loop the following: if the url doesn't work, assume the text blank, and let the loop continue. This is a really a problem because the loop is stopping and I was trying to eliminate some urls, but I have around 200,000 htmls.

If you use `lapply` (or `purrr::map` variants) with `tryCatch`/`purrr::possibly`/`purrr::safely`, the results will automatically be stored in a list you can clean up afterwards. That said, 200k URLs is a lot! If you use `Sys.sleep` to wait 10 seconds between calls (a common request in robots.txt so hosts don't inadvertently get DDOSed by scapers), it would take 23 days even if the code itself was instantaneous. Thus, start by looking for a more direct approach to get your data. — alistaire, May 16 '18 at 05:20

Kim · Accepted Answer · 2018-05-16T05:40:24.567

2

This can be achieved by a simple tryCatch and error handling. I also introduced a list a in which you can store your outputs (currently you will be overwriting your output in each loop.)

library(rvest)
a <- list()
for (i in 1:length(url)) {
  url_use = url[[i]]
  a[[i]] <- 
    tryCatch({
        read_html(url_use) %>%
            html_nodes("p") %>%
            html_text()
    }, error = function(e) NA)
}

Let me know if this is not what you had in mind.

edited May 16 '18 at 05:40

answered May 16 '18 at 05:13

Kim

4,080
2
30
51

will that assignment ` a[[i]] <- ` work from inside the `error` function? – R.S. May 16 '18 at 05:26
@R.S. Uhh, nooo... (groans at one's own stupidity). Updated answer. Thanks for pointing it out. – Kim May 16 '18 at 05:41

score 1 · Answer 2 · answered May 16 '18 at 05:50

1

You should be able to just switch to html_node instead of html_nodes.

html_node will return NA if nothing is matched.

Without any sample URLs, I can't test, however.

See these Q&A for more reference

answered May 16 '18 at 05:50

MichaelChirico

33,841
14
113
198

I think this would be simplest way of solving this problem. Thank you for this! – thus__ Jun 15 '19 at 13:24

Try to fill with " " in case of error in loop of html_nodes

2 Answers2