4

I am using rvest to webscrape in R, and I'm running into memory issues. I have a 28,625 by 2 data frame of strings called urls that contains the links to the pages I'm scraping. A row of the frame contains two related links. I want to generate a 28,625 by 4 data frame Final with information scraped from the links. One piece of information is from the second link in a row, and the other three are from the first link. The xpaths to the three pieces of information are stored as strings in the vector xpaths. I am doing this with the following code:

data <- rep("", 4 * 28625)

k <- 1

for (i in 1:28625) {

  name <- html(urls[i, 2]) %>%
    html_node(xpath = '//*[@id="seriesDiv"]/table') %>%
    html_table(fill = T)

  data[k] <- name[4, 3]

  data[k + 1:3] <- html(urls[i, 1]) %>% 
    html_nodes(xpath = xpaths) %>%
    html_text()

  k <- k + 4

}

dim(data) <- c(4, 28625)
Final <- as.data.frame(t(data))

It works well enough, but when I open the task manager, I see that my memory usage has been monotonically increasing and is currently at 97% after about 340 iterations. I'd like to just start the program and come back in a day or two, but all of my RAM will be exhausted before the job is done. I've done a bit of research on how R allocates memory, and I've tried my best to preallocate memory and modify in place, to keep the code from making unnecessary copies of things, etc.

Why is this so memory intensive? Is there anything I can do to resolve it?

jcz
  • 267
  • 1
  • 9
  • 2
    It's probably a memory leak in the underlying xml or Rcurl code. Both can be replaced now but I have had a chance yet – hadley Aug 14 '15 at 03:22
  • 1
    This could be the same underlying issue: http://stackoverflow.com/questions/23696391/memory-leak-when-using-package-xml-on-windows. I am performing a similar sort of scraping as you with rvest and want to run my scrape function 80k times but I have to do it in chunks of 3k, then reboot R, due to this memory leak. – Sam Firke Aug 19 '15 at 02:39
  • @SamFirke I'm doing exactly the same thing. Can you automate the R reboot, or do you just have to babysit it? Thanks for referring me to that question. Resolving these issues is probably above my paygrade, but it's good to know folks are thinking about them. – jcz Aug 19 '15 at 15:32
  • I just babysit it, unfortunately. I try to set my chunks to the maximum size that will not crash RStudio. Writing a GitHub commit to fix the memory leak in an underlying package is above my ability, too. Seems like a bug only affecting Windows users? Possible workarounds are in answers to http://stackoverflow.com/questions/23696391/memory-leak-when-using-package-xml-on-windows/ and http://stackoverflow.com/questions/24497562/workaround-to-r-memory-leak-with-xml-package but nothing simple. One person just gave up and used Python... – Sam Firke Aug 19 '15 at 16:13
  • 1
    @SamFirke Alright, we're doing the same thing, then. I'm running the code on Windows and Mac OS X, and I have the same problem. I'm probably going to have more, larger webscraping tasks in the future, so learning Beautiful Soup will be a good long-term investment. – jcz Aug 19 '15 at 17:34

1 Answers1

1

Rvest has been updated to resolve this issue. See here:

http://www.r-bloggers.com/rvest-0-3-0/

jcz
  • 267
  • 1
  • 9