6

In R, I am trying to combine and convert several sets of timeseries data as an xts from http://www.truefx.com/?page=downloads however, the files are large and there many files so this is causing me issues on my laptop. They are stored as a csv file which have been compressed as a zip file.

Downloading them and unzipping them is easy enough (although takes up a lot of space on a hard drive).

Loading the 350MB+ files for one month's worth of data into the R is reasonably straight forward with the new fread() function in the data.table package.

Some datatable transformations are done (inside a function) so that the timestamps can be read easily and a mid column is produced. Then the datatable is saved as an RData file on the hard drive, and all references are to the datatable object are removed from the workspace, and a gc() is run after removal...however when looking at the R session in my Activity Monitor (run from a Mac)...it still looks like it is taking up almost 1GB of RAM...and things seem a bit laggy...I was intending to load several years worth of the csv files at the same time, convert them to useable datatables, combine them and then create a single xts object, which seems infeasible if just one month uses 1GB of RAM.

I know I can sequentially download each file, convert it, save it shut down R and repeat until i have a bunch of RData files that i can just load and bind, but was hopeing there might be a more efficient manner to do this so that after removing all references to a datatable you get back not "normal" or at startup levels of RAM usage. Are there better ways of clearing memory than gc()? Any suggestions would be greatly appreciated.

h.l.m
  • 13,015
  • 22
  • 82
  • 169
  • 1
    Do you have the same symptoms if you use `read.csv` instead of `fread`? – GSee Jan 22 '13 at 01:54
  • 1
    It seems to be related to http://stackoverflow.com/questions/1467201/forcing-garbage-collection-to-run-in-r-with-the-gc-command – redmode Jan 22 '13 at 10:30
  • using `x <- read.csv(...)` brings up the memory usage to 1.2GB and then running `rm(x)`, followed by `gc()`, brings it only down to 894MB...still no where near the original ~75MB ram usage..at startup/initiation of R. – h.l.m Jan 22 '13 at 14:42
  • @redmode I agree that it is probably related to that question that you put the link to, however, the suggested solutions running "gc()" many times over didn't seem to help much...in brining the RAM usage down... – h.l.m Jan 22 '13 at 14:44

1 Answers1

7

In my project I had to deal with many large files. I organized the routine on the following principles:

  1. Isolate memory-hungry operations in separate R scripts.
  2. Run each script in new process which is destroyed after execution. Thus system gives used memory back.
  3. Pass parameters to the scripts via text file.

Consider the toy example below.

Data generation:

setwd("/path/to")
write.table(matrix(1:5e7, ncol=10), "temp.csv") # 465.2 Mb file

slave.R - memory consuming part

setwd("/path/to")
library(data.table)

# simple processing
f <- function(dt){
  dt <- dt[1:nrow(dt),]
  dt[,new.row:=1]
  return (dt)
}

# reads parameters from file
csv <- read.table("io.csv")
infile  <- as.character(csv[1,1])
outfile <- as.character(csv[2,1])

# memory-hungry operations
dt <- as.data.table(read.csv(infile))
dt <- f(dt)
write.table(dt, outfile)

master.R - executes slaves in separate processes

setwd("/path/to")

# 3 files processing
for(i in 1:3){
  # sets iteration-specific parameters
  csv <- c("temp.csv", paste("temp", i, ".csv", sep=""))
  write.table(csv, "io.csv")

  # executes slave process
  system("R -f slave.R")
}
redmode
  • 4,821
  • 1
  • 25
  • 30
  • +1 for effort but that `read.csv(infile)` isn't using `colClasses` so it may be memory hungry for that reason, and others. Have you tried the new `fread` function in `data.table` v1.8.7 on R-Forge? Users are reporting success with it. – Matt Dowle Jan 25 '13 at 10:50
  • @MatthewDowle: yes, I tried `fread` and found it awesome for my needs. But, AFAIK, current CRAN `data.table` doesn't have it? So I decided to put _safe_ `read.table` and `write.table` here. – redmode Jan 25 '13 at 13:22
  • @MatthewDowle: What's more important, @h.l.m reported the memory issue while he was using `fread`. So, this is not the main point of my answer. Basic idea was to run memory hungry tasks in separate processes in order to insure that memory comes back. – redmode Jan 25 '13 at 13:27
  • Ok Thanks, I didn't pick up on that. If `read.csv` reads into a character vector, before coercing to `integer` or `double` for example (as it will when `colClases` hasn't been supplied), all those character values will be cached in R's global string cache. As far as I know that cache only grows (unaffected by `gc()`). So you really do want to avoid that by setting `colClasses` or using `fread`. Otherwise it gets difficult to track what's going on. If h.l.m. was already using `fread` then I'll have to study question again. – Matt Dowle Jan 25 '13 at 15:48
  • Thank you @redmode great solution! works perfectly! a nice little trick to free up memory! especially when in combo with `fread()` and thanks again to @MatthewDowle for producing it! – h.l.m Jan 25 '13 at 15:49
  • @MatthewDowle as a side point...any idea how to clear or reduce the size of "R's global string cache"? – h.l.m Jan 25 '13 at 15:50
  • @h.l.m Great. Glad redmode's solution works, but would be even nicer to get to the root cause. By chance do you have a lot of unique strings in your data (such as dates) which would be read as character even by fread (currently) which you then have to convert? Would that fit with my long comment above? – Matt Dowle Jan 25 '13 at 15:54
  • @h.l.m As far as I know you can't reduce the size of R's global string cache. It only grows. That would be a good new question! – Matt Dowle Jan 25 '13 at 15:57
  • @MatthewDowle, yes my data does have dates in it...but didn't bother setting colClass as I was using fread and I assumed it detected it for me...The reading of the csv data is what kicks up the RAM usage...the data.table is then manipulated a bit to convert to POSIX timestamps, this also pushes up RAM usage, but even when all objects are removed to deal with the next problem...RAM stays quite high...making the next problem very very slow to run. – h.l.m Jan 25 '13 at 17:40
  • @h.l.m This sounds like a different problem then. Hm. – Matt Dowle Jan 25 '13 at 17:48
  • @h.l.m maybe you ought to make your example reproducible and show us what you're doing in those intermediate steps. – GSee Jan 25 '13 at 17:59