0

I tried aggregation on large dataset using 'ffbase' package using ffdfdply function in R.
lets say I have three variables called Date,Item and sales. Here I want to aggregate the sales over Date and Item using sum function. Could you please guide me through some proper syntax in R.
Here I tried like this:

grp_qty <- ffdfdply(x=data[c("sales","Date","Item")], split=as.character(data$sales),FUN = function(data)  

summaryBy(Date+Item~sales, data=data, FUN=sum)).

I would appreciate for your solution.

  • `ffdfply` isn't in base R. You should mention what package(s) you're using. – Dason Jan 06 '14 at 14:02
  • okay, here i used ff package. In that we have ffdfdply() is there for aggregation. could u help me out. – Chaitanya Krishna T Jan 06 '14 at 14:04
  • You should edit that into the question. While you're at it maybe clean it up a little bit? 'u' in place of 'you' isn't really the best style here. – Dason Jan 06 '14 at 14:08
  • 2
    In order to achieve "immediate reply" on SO, it is generally better to provide a [**minimal, reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) than to ask for "immediate reply". BTW, `ffdfdply` is in `ffbase` package, not in `ff`. – Henrik Jan 06 '14 at 14:33

1 Answers1

7

Mark that ffdfdply is part of ffbase, not ff. To show an example of the usage of ffdfdply, let's generate an ffdf with 50Mio rows.

  require(ffbase)
  data <- expand.ffgrid(Date = ff(seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")), Item = ff(factor(paste("Item", 1:5000))))
  data$sales <- ffrandom(n = nrow(data))
  # split by date -> assuming that all sales of 1 date can fit into RAM
  splitby <- as.character(data$Date, by = 250000)
  grp_qty <- ffdfdply(x=data[c("sales","Date","Item")], 
                      split=splitby, 
                      FUN = function(data){
                        ## This happens in RAM - containing **several** split elements so here we can use data.table which works fine for in RAM computing
                        require(data.table)
                        data <- as.data.table(data)
                        result <- data[, list(sales = sum(sales, na.rm=TRUE)), by = list(Date, Item)]
                        as.data.frame(result)
                      })
  dim(grp_qty)

Mark that grp_qty is an ffdf which resides on disk.