2

I am trying to quit and restart R from within R. The reason for this is that my job takes up a lot of memory, and none of the common options for cleaning R's workspace reclaim RAM taken up by R. gc(), closeAllConnections(), rm(list = ls(all = TRUE)) clear the workspace, but when I examine the processes in the Windows Task Manager, R's usage of RAM remains the same. The memory is reclaimed when R session is restarted.

I have tried the suggestion from this post:

Quit and restart a clean R session from within R?

but it doesn't work on my machine. It closes R, but doesn't open it again. I am running R x64 3.0.2 through RGui (64-bit) on Windows 7. Perhaps it is just a simple tweak of the first line in the above post:

makeActiveBinding("refresh", function() { shell("Rgui"); q("no") }, .GlobalEnv)

but I am unsure how it needs to be changed.

Here is the code. It is not fully reproducible, because a large list of files is needed that are read in and scraped. What eats memory is the scrape.func(); everything else is pretty small. In the code, I apply the scrape function to all files in one folder. Eventually, I would like to apply to a set of folders, each with a large number of files (~ 12,000 per folder; 50+ folders). Doing so at present is impossible, since R runs out of memory pretty quickly.

library(XML)
library(R.utils)

## define scraper function
scrape.func <- function(file.name){
  require(XML)

  ## read in (zipped) html file
  txt <- readLines(gunzip(file.name))

  ## parse html
  doc <- htmlTreeParse(txt,  useInternalNodes = TRUE)

  ## extract information
  top.data <- xpathSApply(doc, "//td[@valign='top']", xmlValue)
  id <- top.data[which(top.data=="I.D.:") + 1]
  pub.date <- top.data[which(top.data=="Data publicarii:") + 1]
  doc.type <- top.data[which(top.data=="Tipul documentului:") + 1]

  ## tie into dataframe
  df <- data.frame(
    id, pub.date, doc.type, stringsAsFactors=F)
  return(df)
  # clean up
  closeAllConnections()
  rm(txt)
  rm(top.data)
  rm(doc)
  gc()
}

## where to store the scraped data
file.create("/extract.top.data.2008.1.csv")

## extract the list of files from the target folder
write(list.files(path = "/2008/01"), 
      file = "/list.files.2008.1.txt")

## count the number of files
length.list <- length(readLines("/list.files.2008.1.txt"))
length.list <- length.list - 1

## read in filename by filename and scrape
for (i in 0:length.list){
  ## read in line by line
  line <- scan("/list.files.2008.1.txt", '', 
               skip = i, nlines = 1, sep = '\n', quiet = TRUE)
  ## catch the full path 
  filename <- paste0("/2008/01/", as.character(line))
  ## scrape
  data <- scrape.func(filename)
  ## append output to results file
  write.table(data,file = /extract.top.data.2008.1.csv", 
              append = TRUE, sep = ",", col.names = FALSE)
  ## rezip the html
  filename2 <- sub(".gz","",filename)
  gzip(filename2)
}

Many thanks in advance, Marko

Community
  • 1
  • 1
  • Perhaps [FAQ 7.42 Why is R apparently not releasing memory?](http://cran.r-project.org/doc/manuals/R-FAQ.html#Why-is-R-apparently-not-releasing-memory_003f) – Joshua Ulrich Nov 08 '13 at 16:26
  • 1
    Those calls to `rm` in the end of your function are never evaluated since they come after the `return` statement, which ends the function. Furthermore, they are not necessary, since these objects only exist in the function environment. – Roland Nov 08 '13 at 16:26
  • @Roland: The `closeAllConnections` call not being evaluated could be more problematic. The `rm` calls are unnecessary, since those objects will be available for `gc` once the function returns. – Joshua Ulrich Nov 08 '13 at 16:28
  • @JoshuaUlrich: Thanks Joshua for the link; I realize now this is OS's fault. But it still seems to me that the only solution is to restart R, since that makes OS release the memory. And I'm still not sure how to that from inside an R script. - Marko – user2967098 Nov 11 '13 at 16:50
  • @Roland: Roland, thanks for the correction. However, moving `rm` calls inside the function does not reduce the memory use noticeably. - Marko – user2967098 Nov 11 '13 at 16:54

1 Answers1

1

I also did some webscraping and ran directily into the same problem like u and it turned me crazy. Although im running a mordern OS (windows 10), the memory is still not released from time to time. after having a look at R FAQ I went for CleanMem, here u can set an automated memory cleaner at every 5 minutes or so. be sure to use

rm(list = ls())
gc()
closeAllConnections()

before so that R releases the memory. Then use CleanMem so that the OS will notice there's free memory.

NMM
  • 81
  • 7