0

I have created an Rvest scraper that scrapes a job listing site. Unfortunately, it takes for ever to loop through just 100 pages. Is there any quick fix to make this faster? Below is the basic structure I'm using

for(i in beginning:end){
  url <-  read_html(paste0("https://www.jobsite.com",links[[1]][i]))
  address[[i]] <-   html_nodes(x = url, css = selector_name) %>%
    html_text()
  employer[[i]] <- employer[[i]][3]
  rating[[i]] <-   html_nodes(x = url, css = selector_rating) %>%
    html_attr("data-jobsite") %>% as.numeric()
  rating[[i]] <- rating[[i]]*(10/6)
  rating[[i]] <- round(rating[[i]])
  rating[[i]] <- ifelse(length(rating[[i]]) == 0, 1, rating[[i]])
  title[[i]] <-   html_nodes(x = url, css = ".xsmall-10") %>%
    html_text()
  title[[i]] <- stri_replace_all_fixed(title[[i]], "            ", "")
  title[[i]] <- stri_replace_all_fixed(title[[i]], "        ", "")
  title[[i]] <- stri_replace_all_fixed(title[[i]], "\r\n", "")
  dd[[i]] <-   html_nodes(x = url, css = ".item-price")%>%
    html_text()
}
GetReal
  • 1
  • 1
  • Don't use a `for` loop, use `lapply`. If you are going to use a `for` loop, preallocate a vector of the appropriate size (which is hard in this case). And [don't use regex to parse HTML](http://stackoverflow.com/a/1732454/4497050). – alistaire Jan 20 '17 at 05:27
  • 1
    It is slow because you have to download the page every time, one at a time. If you are comfortable with python, I'd like to suggest using aiohttp and asyncio. It's amazingly fast because you can request for many pages simultaneously. https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html. R was designed to be single threaded -- parallelization is possible (see package `future`) but complicated. – Jean Jan 20 '17 at 07:53
  • 1
    How long is forever? – Ansjovis86 Jan 20 '17 at 10:32
  • 1
    You could try this: https://github.com/jeroenooms/curl/blob/master/examples/sitemap.R – Rentrop Jan 20 '17 at 11:56
  • thanks everyone, very helpful, a friend of mine has also suggested to use an HTML DOM parser instead? – GetReal Jan 23 '17 at 09:16
  • There is now a multi-threaded scraper for R. http://www.sciencedirect.com/science/article/pii/S2352711017300110 – CoderGuy123 Jun 30 '17 at 06:00
  • I reimplemented my rvest code in python with aiohttp following the tutorial suggested by @Jean, but ended up just being blocked by the site. Rvest being very slow and working sequentially instead of parallelly actually seems to be an advantage, because it comes closer to human behavior. There's no point in parallelizing and then adding delays again for not being blocked. – sgrubsmyon Apr 06 '20 at 09:13
  • It is highly recommended to even build upon rvest's 'slowness' further: Consider using the R package `polite` (https://github.com/dmi3kno/polite) to follow "responsible web etiquette" while scraping. – sgrubsmyon May 08 '20 at 17:48

0 Answers0