There is currently no tapply or ave for ff_vectors currently implemented in package ff.
But what you can do is use functionality in ffbase.
Let's elaborate on some bigger dataset
require(ffbase)
a <- ffrep.int(ff(1:100000), times=500) ## 50Mio records on disk - not in RAM
b <- ffrandom(n=length(a), rfun = runif)
c <- ffseq_len(length(a))
df <- ffdf(a = a, b = b, c = c) ## on disk
dim(df)
For your simple aggregation method, you can use binned_sum for which you can extract the length easily as follows. Mark that binned_sum needs an ff factor object in the bin, which can be obtained by doing as.character.ff as shown.
df$groupbyfactor <- as.character(df$a)
agg <- binned_sum(x=df$b, bin=df$groupbyfactor, nbins = length(levels(df$groupbyfactor)))
head(agg)
agg[, "count"]
For more complex aggregations you can use ffdfdply in ffbase. What I frequently do is combine it with some data.table statements like this:
require(data.table)
agg <- ffdfdply(df, split=df$groupbyfactor, FUN=function(x){
x <- as.data.table(x)
result <- x[, list(b.mean = mean(b), b.median = median(b), b.length = length(b), whatever = b[c == max(c)][1]), by = list(a)]
result <- as.data.frame(result)
result
})
class(agg)
aggg <- as.data.frame(agg) ## Puts the data in RAM!
This will put your data in RAM in chunks of groups of split elements based on which you can apply a function, like some data.table statements, which require your data to be in RAM. The result of all chunks based on which you applied the function is next combined in a new ffdf, so that you can further use it, or put it into RAM if your RAM allows that size.
The sizes of the chunks are controlled by getOption("ffbatchbytes"). So if you have more RAM, the better as it will allow you to get more data in each chunk in RAM.