5

I'm scraping a website and am calling my scraping function from a for-loop. Around iteration 4,000 of the loop, my computer warned me that RStudio was using too much memory. But after breaking the loop with the escape key I don't see any large objects in my R environment.

I tried the tips on these two posts but they don't reveal a cause. When I call mem_used() from the pryr package I get:

2.3 GB

Which aligns with what Windows task manager said initially. It said 2.3 GB, then dropped to 1.7 GB ten minutes after terminating the loop and 1.2 GB twenty minutes after the loop. mem_used() continues to say 2.3 GB.

But my R objects are small, according to the lsos() function in the first post linked above:

> lsos()
                       Type     Size  Rows Columns
all_raw              tbl_df 17390736 89485      12
all_clean            tbl_df 14693336 89485      15
all_no_PAVs          tbl_df 14180576 86050      15
all_no_dupe_names    tbl_df 13346256 79646      15
sample_in            tbl_df  1917128  9240      15
testdat              tbl_df  1188152  5402      15
username_res         tbl_df   792936  4091      14
getUserName        function   151992    NA      NA
dupe_names           tbl_df   132040  2802       3
time_per_iteration  numeric    65408  4073      NA

That says my largest object is 17 MB, not close to 2.3 GB. How can I find the culprit of the memory use and fix it? Is there something in the loop that is gradually tying up memory?

Here is a reproducible test example, scraping IMDB.com:

library(rvest) # rvest 0.2.0 needed to produce the error, it was fixed in 0.3.0
library(stringr)
library(dplyr)

search_list <- make.names(names(precip))


scrape_top_titles <- function(search_string) {
  urlToGet <- paste0("http://www.imdb.com/find?q=", search_string)
  print(urlToGet)
  page <- html(urlToGet)
  top_3_hits <- page %>%
    html_nodes(xpath='//td[contains(@class, "result_text")]') %>%
    html_text %>%
    str_trim %>%
    .[1:3]

  result <- list(search_term = search_string, hit_1 = top_3_hits[1], hit_2 = top_3_hits[2], hit_3 = top_3_hits[3], page_length = nchar(page %>% html_text))

result
}

# These 70 scrapes will start filling memory
scrapes <- bind_rows(
  lapply(search_list, function(x) {data.frame(scrape_top_titles(x), stringsAsFactors = FALSE)})
)

# For more dramatic memory filling, scrape 770 pages instead
longer_list <- as.vector(outer(search_list, names(mtcars), paste, sep="_"))
long_scrapes <- bind_rows(
  lapply(longer_list, function(x) {data.frame(scrape_top_titles(x), stringsAsFactors = FALSE)})
) 

Update: looks like it was a memory leak from the XML package called by rvest, similar to what is described in this question and manifests in this other question. The 0.3.0 release of rvest calls the xml2 package instead and has solved this memory leak, so the above code no longer generates the error unless an old version of rvest is used.

I'm still looking for an answer that would fully describe what was going on here: can anyone explain the "memory leak"? The problem is fixed but I'm curious about what was happening.

Community
  • 1
  • 1
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • I've always heard that for-loops are very inefficient, but I don't have the first clue how that actually works, so I'm looking forward to seeing an answer here. – ila Jul 13 '15 at 14:25
  • I'd check the environment of `getUserName`, for a start. – Hong Ooi Jul 13 '15 at 14:25
  • @HongOoi how should I do that? I just cleared the whole environment, then ran `gc()` for garbage cleaning, now `ls()` returns `character(0)`, but `mem_used()` still shows 2.26 GB. – Sam Firke Jul 13 '15 at 17:46
  • 1
    Growing objects in R is very inefficient. Try pre-allocating the size of `username_res` and `time_per_iteration`. Also writing `seq(1:nrow(testdat))` is redundant. Use either `seq(nrow(testdat))` or `1:nrow(testdat)`. You may be able to eliminate the data.frame assignment altogether. Create the vectors in the loop and bind them together after. – Pierre L Jul 14 '15 at 14:52
  • 1
    @PierreLafortune thanks! I just re-read about growing objects in the [R inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf). Based on the 2nd answer to http://stackoverflow.com/questions/14580233/why-does-gc-not-free-memory, I think growing objects here is not just slow, it's also fragmenting memory. As the data frames grow bigger, the available holes get used up. Though I don't understand why R doesn't garbage clean after each iteration of the loop, since I just have one pointer for `username_res` that moves around - it could always clean the previous data.frame. – Sam Firke Jul 16 '15 at 16:24
  • @PierreLafortune I liked the binding rows because if I encountered an error, I would retain the results gathered already. Vs. if I create the thousands of vectors in the loop and then bind them, I lose them all if an error is thrown. But I am also using `tryCatch` in `getUserName()` so maybe that precaution is redundant. Anyhow, that is a separate problem from this memory issue. – Sam Firke Jul 16 '15 at 16:27
  • please provide a full reproducible minimal example, so that we can test. Have you tried gc() ? – Karl Forner Oct 14 '15 at 18:05
  • @KarlForner I've added a fully reproducible example and more context; the latest release of rvest (0.3.0) fixed the problem by not using the xml package anymore, so the above example will only produce the memory leak if you install rvest 0.2.0. That said, I'm still hoping someone can explain the behind-the-scenes of what happened here, in case it applies elsewhere. – Sam Firke Oct 14 '15 at 20:28
  • your example do not work with rvest 0.2.0: Error in scrape_top_titles(x) : could not find function "read_html" – Karl Forner Oct 14 '15 at 22:09

0 Answers0