I'm having "Error: cannot allocate vector of size ...MB" problem using ff/ffdf and ffdfdply function.
I'm trying to use ff and ffdf packages to process large amount of data that has been keyed into groups. Data (in ffdf table format) looks like this:
x =
id_1 id_2 month year Amount key
1 13 1 2013 -200 11
1 13 2 2013 300 54
2 19 1 2013 300 82
3 33 2 2013 300 70
.... (10+ Million rows)
The unique keys are created using something like:
x$key = as.ff(as.integer(ikey(x[c("id_1","id_2","month","year")])))
To summarise by grouping using the key variable, I have this command:
summary = ffdfdply(x=x, split=x$key, FUN=function(df) {
df = data.table(df)
df = df[,list(id_1 = id_1[1], withdraw = sum(Amount*(Amount>0),na.rm=T), by = "key"]
df
},trace=T)
Using data.table's excellent grouping feature (idea taken from this discussion). In the real code there are more functions to be applied to the Amount variable, but even with this I can not process the full ffdf table (a smaller subset of the table works fine).
It seems like ffdfdplyis using huge amount of ram, giving the:
Error: cannot allocate vector of size 64MB
Also BATCHBYTES does not seem to help. Any one with experience with ffdffply can recommend any other way to go about this, without pre-splitting the ffdf table into chunks?