0

I'm trying to extract data by text mining with html_nodes using urls that I have saved into an object called url. I have created a loop which reads and scrapes each url.

library(rvest)
for (i in url) {
  tex <- read_html(i)
  p_text <- tex %>%
    html_nodes("p") %>%
    html_text()
  a <- p_text
}

Because some url isn't working, the following message appears:

Error in open.connection(x, "rb") : Could not resolve host: app.lo

I want to introduce in the loop the following: if the url doesn't work, assume the text blank, and let the loop continue. This is a really a problem because the loop is stopping and I was trying to eliminate some urls, but I have around 200,000 htmls.

alistaire
  • 42,459
  • 4
  • 77
  • 117
CatCaller
  • 13
  • 2
  • 7
  • 1
    If you use `lapply` (or `purrr::map` variants) with `tryCatch`/`purrr::possibly`/`purrr::safely`, the results will automatically be stored in a list you can clean up afterwards. That said, 200k URLs is a lot! If you use `Sys.sleep` to wait 10 seconds between calls (a common request in robots.txt so hosts don't inadvertently get DDOSed by scapers), it would take 23 days even if the code itself was instantaneous. Thus, start by looking for a more direct approach to get your data. – alistaire May 16 '18 at 05:20

2 Answers2

2

This can be achieved by a simple tryCatch and error handling. I also introduced a list a in which you can store your outputs (currently you will be overwriting your output in each loop.)

library(rvest)
a <- list()
for (i in 1:length(url)) {
  url_use = url[[i]]
  a[[i]] <- 
    tryCatch({
        read_html(url_use) %>%
            html_nodes("p") %>%
            html_text()
    }, error = function(e) NA)
}

Let me know if this is not what you had in mind.

Kim
  • 4,080
  • 2
  • 30
  • 51
  • will that assignment ` a[[i]] <- ` work from inside the `error` function? – R.S. May 16 '18 at 05:26
  • @R.S. Uhh, nooo... (groans at one's own stupidity). Updated answer. Thanks for pointing it out. – Kim May 16 '18 at 05:41
1

You should be able to just switch to html_node instead of html_nodes.

html_node will return NA if nothing is matched.

Without any sample URLs, I can't test, however.

See these Q&A for more reference

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198