3

I have been trying to use tapply, ave, ddply to create statistics by group of a variable (age, sex). I haven't been able to use above mentioned R commands successfully.

library("ff")
df <- as.ffdf(data.frame(a=c(1,1,1:3,1:5), b=c(10:1), c=(1:10)))
tapply(df$a, df$b, length)

The error message I get is

Error in as.vmode(value, vmode) : 
  argument "value" is missing, with no default

or

Error in byMean(df$b, df$a) : object 'index' not found
sgibb
  • 25,396
  • 3
  • 68
  • 74
askabra
  • 63
  • 4

1 Answers1

2

There is currently no tapply or ave for ff_vectors currently implemented in package ff. But what you can do is use functionality in ffbase. Let's elaborate on some bigger dataset

require(ffbase)
a <- ffrep.int(ff(1:100000), times=500) ## 50Mio records on disk - not in RAM
b <- ffrandom(n=length(a), rfun = runif)
c <- ffseq_len(length(a))
df <- ffdf(a = a, b = b, c = c) ## on disk
dim(df)

For your simple aggregation method, you can use binned_sum for which you can extract the length easily as follows. Mark that binned_sum needs an ff factor object in the bin, which can be obtained by doing as.character.ff as shown.

df$groupbyfactor <- as.character(df$a)
agg <- binned_sum(x=df$b, bin=df$groupbyfactor, nbins = length(levels(df$groupbyfactor)))
head(agg)
agg[, "count"]

For more complex aggregations you can use ffdfdply in ffbase. What I frequently do is combine it with some data.table statements like this:

require(data.table)
agg <- ffdfdply(df, split=df$groupbyfactor, FUN=function(x){
  x <- as.data.table(x)
  result <- x[, list(b.mean = mean(b), b.median = median(b), b.length = length(b),     whatever = b[c == max(c)][1]), by = list(a)]
  result <- as.data.frame(result)
  result
})
class(agg)
aggg <- as.data.frame(agg) ## Puts the data in RAM!

This will put your data in RAM in chunks of groups of split elements based on which you can apply a function, like some data.table statements, which require your data to be in RAM. The result of all chunks based on which you applied the function is next combined in a new ffdf, so that you can further use it, or put it into RAM if your RAM allows that size.

The sizes of the chunks are controlled by getOption("ffbatchbytes"). So if you have more RAM, the better as it will allow you to get more data in each chunk in RAM.

  • Thanks! In the statement df$groupbyfactor <- as.character(df$a) I get the error: Error: cannot allocate vector of size 190.7 Mb I thought this was supposed to store it on disk. – askabra May 10 '13 at 12:08
  • All data is stored on disk. For all vmodes except factors, the virtual part is only a few Kb in size. For factors, all the factor levels are also stored in RAM, the data is stored on disk. So if you can not load in all your factor levels in RAM you get that error. When that happens, it is mostly because you have other objects in RAM also which cause your problem. In this example I don't think 100000 factor levels will be the cause of your overflow but other things you have in RAM during your r session. –  May 10 '13 at 13:09