2

Calculate column of conditional means, in ff/ffbase packages. I'm searching for functionality in ff/ffbase packages, which allow me for data manipulation similar to carried below with data.table package :

library(data.table)
irisdf <- as.data.table(iris)
class(irisdf)
#"ffdf"
irisdf[,  NewMean:= mean(Sepal.Length), Species] 

There is a function for conditional mean in ffbase, but, that creates vector of length of number of classes in irisdf[,5]:

condMean(x = irisdf[,1], index = irisdf[,5], na.rm = FALSE)

, and not new vector of length of nrow(irisdf).

As @BondedDust suggested ave(base) gives right output :

VectorOfMeans <- ave(irisdf[,1], irisdf[,5], FUN=mean)

so the final question is, how to add VectorOfMeans to irisdf, I've tried below code, which works :

irisdf=as.ffdf(iris)
VectorOfMeans <- as.ffdf(as.ff(ave(irisdf[,1], irisdf[,5], FUN=mean)))
irisdf <- cbind.ffdf2(df,VectorOfMeans )

with cbind.ffdf2 from SO answer, but I suppose, that SO question was about something more specific then main, and I suppose there is an easier(faster) way to do that. I would like to be able run bigglm.ff on obtained dataset (irisdf in example), you should look at my question about merging VectorOfMeans and irisdf in this context (as there are issues with physical/virtual modes of storage which I don't understand in details).

Community
  • 1
  • 1
Qbik
  • 5,885
  • 14
  • 62
  • 93
  • I do not think that the machine and data constraints are clear. And there are assignment methods that accept an "i"-argument See ?'[<-.ff'` and `?'[<-.ffdf'` for the help page for the assignment functions for such objects. – IRTFM Feb 08 '15 at 18:39
  • @BondedDust yes there is even `condMean`, for conditional mean, but length of output is number of classes, not the number of rows in dataset and classes labels aren't treated as variable. – Qbik Feb 08 '15 at 18:45
  • In base R this would a problem for the `ave` function. You posted no minimal test case for even the simplest of tests. – IRTFM Feb 08 '15 at 18:48
  • 1
    It wasn't clear to me ... not being an ff user .... whether one could or couldn't use ordinary functions with ff-vectors. Hence my suggestion that you post a reproducible example. You posted a data.table example, ... now post an ff example. Your irisdf is NOT an ffdf object when that code runs on my machine. – IRTFM Feb 08 '15 at 18:57
  • @BondedDust you are right about `ave` – Qbik Feb 08 '15 at 19:37
  • @Qbik May be you could check `ffdfdply` from `library(ffbase)` – akrun Feb 08 '15 at 19:45
  • @akrun could you provide answer with `ffdfdply` ? `ave` seems to break on large datasets (>40m obs), I've just tested it – Qbik Feb 08 '15 at 20:02
  • @Qbik I posted a solution. Please check if that works – akrun Feb 08 '15 at 20:12
  • @BondedDust is there with identical output as `ave` but accepting `factor` as grouping variable ? – Qbik Feb 08 '15 at 21:47
  • `ave` should accept a factor index/grouping vector. – IRTFM Feb 08 '15 at 21:51
  • @BondedDust it seems that I have too large number of levels in factor, (without error message) or mayby there is problem, when working with large `ffdf` objects and – Qbik Feb 08 '15 at 22:07

1 Answers1

1

Perhaps this helps

library(data.table)
library(ffbase)
x1 <- as.ffdf(iris)
fd1 <- ffdfdply(x1, split=as.character(x1$Species), FUN=function(x) {
 x2 <- as.data.table(x)
 res <- x2[, NewMean:= mean(Sepal.Length), Species]
 as.data.frame(res)
}, trace=T)
akrun
  • 874,273
  • 37
  • 540
  • 662