3

I have a problem with R, ffdfdply function

a=as.ffdf(data.frame(b=11:20,c=c(4,4,4,4,4,5,5,5,5,5), d=c(1,1,1,0,0,0,1,0,1,1)))

ffdfdply(a, split=a$c, FUN= function(x) {data.frame(cumsum(x$d))}, trace=T)

The output it generate is simply a cumulative sum without considering the split criteria.

I need an output like this

c   cumsum
4    1
4    2
4    3
4    4
4    4
5    0
5    1
5    1
5    2
5    3

Can we include multiple columns under "split"? It would be great, if anyone provides an example also.

Thanks.


@jwijffels, I test your solution on other set of data

i=as.ffdf(data.frame(a=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2), b=c(1,4,6,2,5,3,1,4,3,2,8,7,1,3,5,4,2,6,3,1,2), c=c(1,1,1,1,1,1,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2), d=c(1,0,1,1,0,1,0,1,1,0,0,1,1,1,0,0,1,1,1,1,0)))

The output I received is incorrect. I need an cumulative sum of column d on the basis of column a and c.

the below step is correct and gave correct result

idx <- ffdforder(i[c("a","c","b")])
ordered_i <- i[idx, ]
ordered_i$key_a_c <- ikey(ordered_i[c("a", "c")])

but when I try to cumulative sum, got incorrect result.

cumsum_i <- ffdfdply(ordered_i, split=as.character(ordered_i$key_a_c), FUN= function(x) {
    ## Data in RAM, on which you can use data.table
    x <- as.data.table(x)
    result <- x[, cumsum_a_c := cumsum(x$d), by = list(key_a_c)]
    as.data.frame(result)
}, trace=T)

Please help. I need to run these set of command on big data.

Ajay
  • 783
  • 3
  • 16
  • 37
  • possible duplicate of [using ffdfdply to split data and get characteristics of each id in the split](http://stackoverflow.com/questions/10981384/using-ffdfdply-to-split-data-and-get-characteristics-of-each-id-in-the-split) – thelatemail Jul 22 '13 at 10:45
  • Use cumsum_a_c := cumsum(d) instead of cumsum_a_c := cumsum(x$d). That is the correct data.table syntax inside FUN. –  Jul 24 '13 at 12:32

2 Answers2

4

The correct usage will be this

require(ffbase)
require(data.table)
a=as.ffdf(data.frame(b=11:20,c=c(4,4,4,4,4,5,5,5,5,5), d=c(1,1,1,0,0,0,1,0,1,1)))
ffdfdply(a, split=as.character(a$c), FUN= function(x) {
  ## Data in RAM, on which you can use data.table
  x <- as.data.table(x)
  result <- x[, cumsum := cumsum(d), by = list(c)]
  as.data.frame(result)
  }, trace=T)

If you want to split by 2 columns, just make a new column combining both columns and use that as split. See ?ikey for creating that column

  • If we have a big data then I suppose x <- as.data.table(x) creates an memory issue. – Ajay Jul 22 '13 at 12:10
  • No, you will not have memory issues. What you get in 'x' is a subset of your 'a' ffdf where groups of data of one or several split elements are put into RAM. The size of that subset which you put into RAM and on which you will apply FUN is controlled by BATCHBYTES. –  Jul 22 '13 at 12:17
  • You are right, I did not use the right data.table syntax inside the fun. It is cumsum := cumsum(d) instead of cumsum := cumsum(x$d). Updated the answer –  Jul 24 '13 at 12:31
1

Reading the help is somewhat helpful here, from ?ffdfdply

this function does not actually split the data. In order to reduce the number of times data is put into RAM for situations with a lot of split levels, the function extracts groups of split elements which can be put into RAM according to BATCHBYTES.

AND....

Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied.

So from my reading of that you need to actually have a split-combine-style function that works on groups within the function you call by ffdfdply as well. Like so using ave:

a$c <- with(a, as.integer(c))
ffdfdply(
    a,
    split=a$c,
    function(x) data.frame(c=x$c,cumsum=ave(x$d,x$c,FUN=cumsum)), 
    trace=T
)

Result:

   c cumsum
1  4      1
2  4      2
3  4      3
4  4      3
5  4      3
6  5      0
7  5      1
8  5      1
9  5      2
10 5      3
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thanks.. Please correct me, if I am wrong. BATCHBYTES play an important role in ffdfdply. If we are not sure about the data and an fixed BATCHBYTES will result an inconsistent result. Can you please give an example to include multiple columns under split. – Ajay Jul 22 '13 at 10:51
  • @Ajay - I am not very knowledgeable on `ff`, but it sounds like `ffdfdply` can possibly take several split groups into one `BATCHBYTE` depending on the size of each group and the size of `BATCHBYTE`. Therefore, you have to have **another** grouping function just in case there are >1 groups in the batch. – thelatemail Jul 22 '13 at 10:54
  • Yes, that is why the doc states "Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied." –  Jul 22 '13 at 12:26