ffdfdply, splitting and memory limit in R

Question

I'm having "Error: cannot allocate vector of size ...MB" problem using ff/ffdf and ffdfdply function.

I'm trying to use ff and ffdf packages to process large amount of data that has been keyed into groups. Data (in ffdf table format) looks like this:

x = 

id_1    id_2    month    year    Amount    key
   1      13        1    2013     -200      11
   1      13        2    2013      300      54
   2      19        1    2013      300      82
   3      33        2    2013      300      70

.... (10+ Million rows)

The unique keys are created using something like:

x$key = as.ff(as.integer(ikey(x[c("id_1","id_2","month","year")])))

To summarise by grouping using the key variable, I have this command:

summary = ffdfdply(x=x, split=x$key, FUN=function(df) {
  df = data.table(df)
  df = df[,list(id_1 = id_1[1], withdraw = sum(Amount*(Amount>0),na.rm=T), by = "key"]
  df
},trace=T)

Using data.table's excellent grouping feature (idea taken from this discussion). In the real code there are more functions to be applied to the Amount variable, but even with this I can not process the full ffdf table (a smaller subset of the table works fine).

It seems like ffdfdplyis using huge amount of ram, giving the:

Error: cannot allocate vector of size 64MB

Also BATCHBYTES does not seem to help. Any one with experience with ffdffply can recommend any other way to go about this, without pre-splitting the ffdf table into chunks?

score 5 · Accepted Answer · answered Aug 16 '13 at 08:19

The most difficult part about using ff/ffbase is making sure your data stays in ff and not accidently put it in RAM. As once you will have put your data in RAM (mostly due to some misunderstanding of when data is put in RAM and when it is not), it is hard to get your RAM back from R and if you are working on your RAM limit, a small extra request of RAM will get your 'Error: cannot allocate vector of size'.

Now, I think you misspecified the input to ikey. Look at ?ikey, it requires as input argument an ffdf, not several ff vectors. Probably this has put your data in RAM while what you wanted is probably to use ikey(x[c("id_1","id_2","month","year")])

It simulated some data on my computer as follows to get an ffdf with 24Mio rows, and the following does not give me RAM troubles (it uses approx 3.5Gb of RAM in my machine)

require(ffbase)
require(data.table)
x <- expand.ffgrid(id_1 = ffseq(1, 1000), id_2 = ffseq(1, 1000), year = as.ff(c(2012,2013)), month = as.ff(1:12))
x$Amount <- ffrandom(nrow(x), rnorm, mean = 10, sd = 5)
x$key <- ikey(x[c("id_1","id_2","month","year")])
x$key <- as.character(x$key)
summary <- ffdfdply(x, split=x$key, FUN=function(df) {
  df <- data.table(df)
  df <- df[, list(
    id_1 = id_1[1], 
    id_2 = id_2[1],
    month = month[1],
    year = year[1],
    withdraw = sum(Amount*(Amount>0), na.rm=T)
  ), by = key]
  df
}, trace=TRUE)

Another reason might be that you have too much other data in RAM which you are not talking about. Mark also that in ff, all your factor levels are in RAM, this might also be an issue if you are working with a lot of character/factor data - in that case you need to be asking yourself whether you really need these data in your analysis or not.

You are right about the ikey, I made a mistake typing it in the question. I have been using it your way, although it is through as.integer not as.character. Corrected above. — tanvach, Aug 16 '13 at 23:39
I think the problem is the there was too many columns that I kept since I was using ffsave/ffload, it would take a long time to have lots of ff datasets. The columns I took out was the keying variable (there were 4 keying columns). Now, the keying has to be done again for each grouping, not ideal but at least the data fits in memory. — tanvach, Aug 17 '13 at 00:02
So in summary: - Do not use factors. It will slow everything down. - Keep as few columns as possible. — tanvach, Aug 17 '13 at 00:05

ffdfdply, splitting and memory limit in R

1 Answers1