I am using rvest
to webscrape in R, and I'm running into memory issues. I have a 28,625 by 2 data frame of strings called urls
that contains the links to the pages I'm scraping. A row of the frame contains two related links. I want to generate a 28,625 by 4 data frame Final
with information scraped from the links. One piece of information is from the second link in a row, and the other three are from the first link. The xpaths to the three pieces of information are stored as strings in the vector xpaths
. I am doing this with the following code:
data <- rep("", 4 * 28625)
k <- 1
for (i in 1:28625) {
name <- html(urls[i, 2]) %>%
html_node(xpath = '//*[@id="seriesDiv"]/table') %>%
html_table(fill = T)
data[k] <- name[4, 3]
data[k + 1:3] <- html(urls[i, 1]) %>%
html_nodes(xpath = xpaths) %>%
html_text()
k <- k + 4
}
dim(data) <- c(4, 28625)
Final <- as.data.frame(t(data))
It works well enough, but when I open the task manager, I see that my memory usage has been monotonically increasing and is currently at 97% after about 340 iterations. I'd like to just start the program and come back in a day or two, but all of my RAM will be exhausted before the job is done. I've done a bit of research on how R allocates memory, and I've tried my best to preallocate memory and modify in place, to keep the code from making unnecessary copies of things, etc.
Why is this so memory intensive? Is there anything I can do to resolve it?