69

I have a data table in R:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table:

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

Zach
  • 29,791
  • 35
  • 142
  • 201

4 Answers4

76

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

  • There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.

  • dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().

  • melt:

    • is capable of melting on columns of type 'list'.

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).

    • melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.

  • dcast:

    • accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.

  • Old benchmarks:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

Arun
  • 116,683
  • 26
  • 284
  • 387
Zach
  • 29,791
  • 35
  • 142
  • 201
  • 1
    To be fair, it took a while...but Arun posted a solution on another post that I replicated here. What do you think? – Christoph_J Mar 19 '13 at 23:52
  • @Zach, as long as you're editing, why not provide a bit more information on where/how to get it...? – Arun Nov 25 '13 at 19:20
  • 1
    Zach, I've expanded it a bit and also provided info from NEWS so that users can get an idea easily. Hope it's alright. – Arun Nov 25 '13 at 21:39
33

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.

I just saw this great chunk of code from Arun here on SO. So I guess there is a data.table solution. Applied to this problem:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6), 
                  y=c("A","B"), 
                  v=sample(1:100,12))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
   x        A        B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650

This gives the same results as DWin's approach:

tapply(DT$v,list(DT$x, DT$y), FUN=sum)
         A        B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650

Also, it is fast:

system.time({ 
   out <- DT[,list(SUM=sum(v)),by=list(x,y)]
   out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
##  user  system elapsed 
## 0.64    0.05    0.70 
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user  system elapsed 
## 7.23    0.16    7.39 

UPDATE

So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first:

library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)

intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]

out[, as.list(setattr(SUM, 'names', y)), by=list(x)]

Summary

Combining the comments with the above, here's the 1-line solution:

DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(V1), paste(y)), by = x]

It's also easy to modify this to have more than just the sum, e.g.:

DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
#   x A.sum B.sum   A.mean B.mean
#1: 1    72   123 36.00000   61.5
#2: 2    84   119 42.00000   59.5
#3: 3   187    96 62.33333   48.0
#4: 4    NA    81       NA   81.0
Community
  • 1
  • 1
Christoph_J
  • 6,804
  • 8
  • 44
  • 58
  • Thanks! That's some excellent code. One question: what can I do if the each subgroup doesn't necessarily have all the columns? E.g. if there was a value for y of C, that was only present when x=4? – Zach Apr 04 '13 at 20:08
  • @Zach Great comment! I recently tried my solution on a large data set and it didn't work, but didn't figure out why. Thanks to your comment, I know now. So basically, you have to update the data.table first and manually insert all combinations. (I do that with `expand.grid`, but I'm sure there are better solutions out there). I wondered if this is overkill, but I don't see how. As soon as you reshape a table into wide format, you are creating all combinations anyways. I think that's a big advantage of the long format: for sparsely densily matrices, this is more efficient. – Christoph_J Apr 04 '13 at 21:43
  • 2
    I think that data.table's cross-join (CJ) would work as a replacement for `expand.grid` above. `intDT<-out[,list(x,y)]; setkey(intDT,x,y); intDT<-intDT[CJ(unique(x),unique(y))];` It runs faster on my system, which I would expect for a pure data.table solution. – Matt May 16 '13 at 19:36
  • Can we do this solution a bit more general ? the problem here is that youhave to change the names after your last line, and this is not going to work if you want to expand more than one column... Say you have SUM,DIFF,AVG and you want to expand those at once ? – statquant Jun 05 '13 at 14:30
  • The 1-liner in the summary is great! It's succinct and much faster than the original answer, yet it is buried at the bottom... Could the answer be edited to emphasize the destination (the 1-liner) rather than the journey? Also, an explanation of what `setNames(...` is doing would be helpful. – dnlbrky Sep 23 '13 at 15:18
  • You mention "Zach's answer" in your header? @Zach only has comments, not an answer, as far as I see... You mean Ramnath? – Frank Oct 29 '14 at 20:06
  • 2
    @Frank My answer has now floated to the top. See that for the most current way to reshape a data.table. This answer will work if you have an old version of data.table or want to hack something together yourself. – Zach Oct 29 '14 at 20:25
22

Data.table objects inherit from 'data.frame' so you can just use tapply:

> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
   AA  BB
a  72 123
b  84 119
c 162  96
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Will this function be significantly faster than using tapply on a data.frame? – Zach Aug 01 '11 at 17:50
  • From the quick tests that I did, `tapply` isn't any faster on a data.table than on a data.frame. I guess I'll stick with the uglier but faster code. – Zach Aug 01 '11 at 17:56
  • 1
    I don't know. I'm guessing not. Fastest would be DT[, sum(v), by=list(x, y) ] but it doesn't result in the layout you requested. – IRTFM Aug 01 '11 at 17:58
  • 1
    I suppose it's best to think about this as a 2-step operation. Step one is `DT[, sum(v), by=list(x, y)]`, which works great. Step 2 is to reshape the result from long to wide... I'm trying to figure out the best way to do this with a data table – Zach Aug 01 '11 at 18:05
  • 2
    i benchmarked the three approaches using `dcast`, `tapply` and `data.table` and found that `tapply` works the fastest by an order of magnitude which is surprising given that `data.table` is optimized. i suspect it is on account of not defining `keys` on which the `data.table` optimization works – Ramnath Aug 01 '11 at 21:25
  • You really need the resulting row names to be one of the columns, if you wanted to merge this back with your original data.table. You could do `res=tapply(DT$v,list(DT$x, DT$y), FUN=sum)` then `res = cbind(v=rownames(res), res)`, then convert to a data.table, then set the key to be `v`, then you'd be able to merge this with other results. Of course the `v` in the `cbind` step is hard coded... and I'm not sure how you'd automate that. Maybe with the dt solution below, but it gets messy and complicated. – geneorama Jun 13 '13 at 15:58
7

You can use dcast from reshape2 library. Here is the code

# DUMMY DATA
library(data.table)
mydf = data.table(
  x = rep(1:3, each = 4),
  y = rep(c('A', 'B'), times = 2),
  v = rpois(12, 30)
)

# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")

NOTE: The tapply solution would be much faster.

Ramnath
  • 54,439
  • 16
  • 125
  • 152