I'm scraping a website and am calling my scraping function from a for-loop. Around iteration 4,000 of the loop, my computer warned me that RStudio was using too much memory. But after breaking the loop with the escape key I don't see any large objects in my R environment.
I tried the tips on these two posts but they don't reveal a cause. When I call mem_used()
from the pryr
package I get:
2.3 GB
Which aligns with what Windows task manager said initially. It said 2.3 GB, then dropped to 1.7 GB ten minutes after terminating the loop and 1.2 GB twenty minutes after the loop. mem_used()
continues to say 2.3 GB.
But my R objects are small, according to the lsos()
function in the first post linked above:
> lsos()
Type Size Rows Columns
all_raw tbl_df 17390736 89485 12
all_clean tbl_df 14693336 89485 15
all_no_PAVs tbl_df 14180576 86050 15
all_no_dupe_names tbl_df 13346256 79646 15
sample_in tbl_df 1917128 9240 15
testdat tbl_df 1188152 5402 15
username_res tbl_df 792936 4091 14
getUserName function 151992 NA NA
dupe_names tbl_df 132040 2802 3
time_per_iteration numeric 65408 4073 NA
That says my largest object is 17 MB, not close to 2.3 GB. How can I find the culprit of the memory use and fix it? Is there something in the loop that is gradually tying up memory?
Here is a reproducible test example, scraping IMDB.com:
library(rvest) # rvest 0.2.0 needed to produce the error, it was fixed in 0.3.0
library(stringr)
library(dplyr)
search_list <- make.names(names(precip))
scrape_top_titles <- function(search_string) {
urlToGet <- paste0("http://www.imdb.com/find?q=", search_string)
print(urlToGet)
page <- html(urlToGet)
top_3_hits <- page %>%
html_nodes(xpath='//td[contains(@class, "result_text")]') %>%
html_text %>%
str_trim %>%
.[1:3]
result <- list(search_term = search_string, hit_1 = top_3_hits[1], hit_2 = top_3_hits[2], hit_3 = top_3_hits[3], page_length = nchar(page %>% html_text))
result
}
# These 70 scrapes will start filling memory
scrapes <- bind_rows(
lapply(search_list, function(x) {data.frame(scrape_top_titles(x), stringsAsFactors = FALSE)})
)
# For more dramatic memory filling, scrape 770 pages instead
longer_list <- as.vector(outer(search_list, names(mtcars), paste, sep="_"))
long_scrapes <- bind_rows(
lapply(longer_list, function(x) {data.frame(scrape_top_titles(x), stringsAsFactors = FALSE)})
)
Update: looks like it was a memory leak from the XML package called by rvest, similar to what is described in this question and manifests in this other question. The 0.3.0 release of rvest calls the xml2 package instead and has solved this memory leak, so the above code no longer generates the error unless an old version of rvest is used.
I'm still looking for an answer that would fully describe what was going on here: can anyone explain the "memory leak"? The problem is fixed but I'm curious about what was happening.