My Rvest scraper is way too slow

Question

I have created an Rvest scraper that scrapes a job listing site. Unfortunately, it takes for ever to loop through just 100 pages. Is there any quick fix to make this faster? Below is the basic structure I'm using

for(i in beginning:end){
  url <-  read_html(paste0("https://www.jobsite.com",links[[1]][i]))
  address[[i]] <-   html_nodes(x = url, css = selector_name) %>%
    html_text()
  employer[[i]] <- employer[[i]][3]
  rating[[i]] <-   html_nodes(x = url, css = selector_rating) %>%
    html_attr("data-jobsite") %>% as.numeric()
  rating[[i]] <- rating[[i]]*(10/6)
  rating[[i]] <- round(rating[[i]])
  rating[[i]] <- ifelse(length(rating[[i]]) == 0, 1, rating[[i]])
  title[[i]] <-   html_nodes(x = url, css = ".xsmall-10") %>%
    html_text()
  title[[i]] <- stri_replace_all_fixed(title[[i]], "            ", "")
  title[[i]] <- stri_replace_all_fixed(title[[i]], "        ", "")
  title[[i]] <- stri_replace_all_fixed(title[[i]], "\r\n", "")
  dd[[i]] <-   html_nodes(x = url, css = ".item-price")%>%
    html_text()
}

Don't use a `for` loop, use `lapply`. If you are going to use a `for` loop, preallocate a vector of the appropriate size (which is hard in this case). And [don't use regex to parse HTML](http://stackoverflow.com/a/1732454/4497050). — alistaire, Jan 20 '17 at 05:27
It is slow because you have to download the page every time, one at a time. If you are comfortable with python, I'd like to suggest using aiohttp and asyncio. It's amazingly fast because you can request for many pages simultaneously. https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html. R was designed to be single threaded -- parallelization is possible (see package `future`) but complicated. — Jean, Jan 20 '17 at 07:53
You could try this: https://github.com/jeroenooms/curl/blob/master/examples/sitemap.R — Rentrop, Jan 20 '17 at 11:56
thanks everyone, very helpful, a friend of mine has also suggested to use an HTML DOM parser instead? — GetReal, Jan 23 '17 at 09:16
There is now a multi-threaded scraper for R. http://www.sciencedirect.com/science/article/pii/S2352711017300110 — CoderGuy123, Jun 30 '17 at 06:00
I reimplemented my rvest code in python with aiohttp following the tutorial suggested by @Jean, but ended up just being blocked by the site. Rvest being very slow and working sequentially instead of parallelly actually seems to be an advantage, because it comes closer to human behavior. There's no point in parallelizing and then adding delays again for not being blocked. — sgrubsmyon, Apr 06 '20 at 09:13
It is highly recommended to even build upon rvest's 'slowness' further: Consider using the R package `polite` (https://github.com/dmi3kno/polite) to follow "responsible web etiquette" while scraping. — sgrubsmyon, May 08 '20 at 17:48

My Rvest scraper is way too slow

0 Answers0