9

Suppose I have a data.table in "melted" form where I have a key, and identifier and a value

library(data.table)
library(reshape2)
DT = data.table(X = c(1:5, 1:4), Y = c(rep("A", 5), rep("B", 4)), Z = rnorm(9))
DT2 = data.table(dcast(DT, X~Y))

How can I perform that sort of self join inside data.table?

> DT
   X Y           Z
1: 1 A -0.19790449
2: 2 A  0.17906116
3: 3 A  0.01821837
4: 4 A  0.17309716
5: 5 A  0.05962474
6: 1 B -0.24629468
7: 2 B  0.92285734
8: 3 B  0.66002573
9: 4 B -1.01403880
> DT2
   X           A          B
1: 1 -0.19790449 -0.2462947
2: 2  0.17906116  0.9228573
3: 3  0.01821837  0.6600257
4: 4  0.17309716 -1.0140388
5: 5  0.05962474         NA

Aside (mostly for Arun): Here is a solution I already use for melt (was written with help from Matthew D, so he should have this code), that I think replicates melt completely, and is pretty efficient. Dcast on the other hand (or should that be dtcast?) is much harder!

melt.data.table = function(data, id.vars, measure.vars,
                           variable.name = "variable",
                           ..., na.rm = FALSE, value.name = "value") {
  if(missing(id.vars)){
    id.vars = setdiff(names(data), measure.vars)
  }
  if(missing(measure.vars)){
    measure.vars = setdiff(names(data), id.vars)
  }

  dtlist = lapply(measure.vars, function(..colname) {
    data[, c(id.vars, ..colname), with = FALSE][, (variable.name) := ..colname]
  })

  dt = rbindlist(dtlist)
  setnames(dt, measure.vars[1], value.name)
  if(na.rm){
    return(na.omit(dt))
  } else {
    return(dt)
  }
}
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Corvus
  • 7,548
  • 9
  • 42
  • 68
  • 2
    thanks for sharing your code. As you say, `melt` is easier. I managed to write it in C and it's available in 1.8.11 as of now, if you want to test it out. Currently, I'm trying to get significant speed-ups for `cast` in C. I'll update once I've managed to commit. – Arun Sep 11 '13 at 09:37

1 Answers1

8

Update: faster versions of melt and dcast are now implemented (in C) in data.table versions >= 1.9.0. Check this post for more info.

Now you can just do:

dcast.data.table(DT, X~Y)

In case of dcast alone, at the moment, it has to be written out completely (as it's not a S3 generic yet in reshape2). We'll try to fix this as soon as possible. For melt, you can just use melt(.) as you'd do normally.


The general idea is this:

setkey(DT, X, Y)
DT[CJ(1:5, c("A", "B"))][, as.list(Z), by=X]

You can name the columns V1 and V2 as A and B using setnames.

But this may not be efficient on large data or when the cast formula is complex. Or rather I should say, it could be much more efficient. We're in the process of finding such an implementation to integrate melt and cast on to data.table. Until then, you could get around this as above.

I'll update this post once we've made significant progress with melt/cast.

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 2
    Whoa +1 I did not know about `SJ` and `CJ`. Can't wait for a `dcast.data.table` option too! – Justin Sep 11 '13 at 01:58
  • Thanks @Arun, I've added the melt code I already have to the end of the question in case it interests you (based on your last sentence). In a more general case of your answer, I presume I just replace `1:5` with something like `unique(DT[Y %in% c("A","B"), X])` or is there a shorter way? – Corvus Sep 11 '13 at 07:05
  • 2
    Or `DT[CJ(unique(X), unique(Y))][...]` (to avoid `DT$` repetition) and `DT[CJ(unique(X), unique(Y)),...]` to bring as much into one `[...]` as possible for query optimization to have more to chew on (since everything inside `[...]` is delayed evaluation). Btw @Carone, Arun's melt in C should be significantly better than our melt. He's really got into the C side of `data.table` and fixed some segfault bugs too. When you read [NEWS](https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable) it's not always me behind it now. – Matt Dowle Sep 11 '13 at 11:11