12

Is there a way of elegantly calculating the correlations between values if those values are stored by group in a single column of a data.table (other than converting the data.table to a matrix)?

library(data.table)
set.seed(1)             # reproducibility
dt <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8))
setkey(dt, group)

#    id group      value
# 1:  1     a -0.6264538
# 2:  2     a  0.1836433
# 3:  3     a -0.8356286
# 4:  4     a  1.5952808
# 5:  1     b  0.3295078
# 6:  2     b -0.8204684
# 7:  3     b  0.4874291
# 8:  4     b  0.7383247

Something that works, but requires the group names as input:

cor(dt["a"]$value, dt["b"]$value)
# [1] 0.1556371

I'm looking more for something like:

dt[, cor(value, value), by="group"]

But that does not give me the correlation(s) I'm after.

Here's the same problem for a matrix with the correct results.

set.seed(1)             # reproducibility
m <- matrix(rnorm(8), ncol=2)
dimnames(m) <- list(id=1:4, group=letters[1:2])

#        group
# id           a          b
#   1 -0.6264538  0.3295078
#   2  0.1836433 -0.8204684
#   3 -0.8356286  0.4874291
#   4  1.5952808  0.7383247

cor(m)                  # correlations between groups

#           a         b
# a 1.0000000 0.1556371
# b 0.1556371 1.0000000

Any comments or help greatly appreciated.

Bram Visser
  • 573
  • 1
  • 6
  • 18

3 Answers3

9

I've since found an even simple alternative for doing this. You were actually pretty close with your dt[, cor(value, value), by="group"] approach. What you actually need is to first do a Cartesian join on the dates, and then group by. I.e.

dt[dt, allow.cartesian=T][, cor(value, value), by=list(group, group.1)]

This has the advantage that it will join the series together (rather than assume they are the same length). You can then cast this into matrix form, or leave it as it is to plot as a heatmap in ggplot etc.

Full Example

setkey(dt, id)
c <- dt[dt, allow.cartesian=T][, list(Cor = cor(value, value.1)), by = list(group, group.1)]
c

   group group.1       Cor
1:     a       a 1.0000000
2:     b       a 0.1556371
3:     a       b 0.1556371
4:     b       b 1.0000000

dcast(c, group~group.1, value.var = "Cor")

  group         a         b
1     a 1.0000000 0.1556371
2     b 0.1556371 1.0000000
Corvus
  • 7,548
  • 9
  • 42
  • 68
6

There is no simple way to do this with data.table. The first way you've provided:

cor(dt["a"]$value, dt["b"]$value)

Is probably the simplest.

An alternative is to reshape your data.table from "long" format, to "wide" format:

> dtw <- reshape(dt, timevar="group", idvar="id", direction="wide")
> dtw
   id    value.a    value.b
1:  1 -0.6264538  0.3295078
2:  2  0.1836433 -0.8204684
3:  3 -0.8356286  0.4874291
4:  4  1.5952808  0.7383247
> cor(dtw[,list(value.a, value.b)])
          value.a   value.b
value.a 1.0000000 0.1556371
value.b 0.1556371 1.0000000

Update: If you're using data.table version >= 1.9.0, then you can use dcast.data.table instead which'll be much faster. Check this post for more info.

dcast.data.table(dt, id ~ group)
Community
  • 1
  • 1
Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64
  • 2
    Are you aware of `dcast.data.table`? – Roland Mar 15 '14 at 10:09
  • I think I've come across it before, but now that I dig deeper it turns out I'm stuck on an older version of `data.table` – Scott Ritchie Mar 15 '14 at 10:14
  • I've updated the answer to reflect the latest version of `data.table` – Scott Ritchie Mar 15 '14 at 23:19
  • 1
    Prior to version `1.8.11`, `reshape` didn't work with `data.table`s, because `dcast.data.table` and `melt.data.table` weren't implemented, so you had to cast to a `data.frame` and back. – Scott Ritchie Mar 15 '14 at 23:41
  • 2
    @ScottRitchie, not true. I think you're confusing base function from Hadley's package. `reshape` is a base function, that'll work just fine on *all* versions of `data.table`. `reshape` and `reshape2` are also packages with functions melt/cast. `data.table >= 1.9.0` versions implement `melt` and `dcast` methods and imports `reshape2`. – Arun Mar 15 '14 at 23:53
  • @Arun thanks for clearing that up. I've verified that `base:reshape` works on 'data.table' without importing the `reshape2 library`. However, for a slightly larger example with 1M rows, `dcast.data.table` is ten times faster than `reshape`, so I'll stick with `dcast` for now. – Bram Visser Mar 16 '14 at 01:24
  • 1
    @BramVisser, yes, right and right. `dcast.data.table` takes advantage of all data.table internals, + other essentials written in C. Let us know if you come across any issues [here](https://r-forge.r-project.org/tracker/?atid=975&group_id=240&func=browse). Remember you'll have to spell it out completely `dcast.data.table` and not just `dcast` as of now.. It'll be fixed later. If you just use `dcast` it'll use `reshape2's dcast`, which is much slower than `dcast.data.table`. – Arun Mar 16 '14 at 01:56
6

I don't know a way to get it in matrix form straight away, but I find this solution useful:

dt[, {x = value; dt[, cor(x, value), by = group]}, by=group]

   group group        V1
1:     a     a 1.0000000
2:     a     b 0.1556371
3:     b     a 0.1556371
4:     b     b 1.0000000

since you started with a molten dataset and you end up with a molten representation of the correlation.

Using this form you can also choose to just calculate certain pairs, in particular it is a waste of time calculating both off diagonals. For example:

 dt[, {x = value; g = group; dt[group <= g, list(cor(x, value)), by = group]}, by=group]
   group group        V1
1:     a     a 1.0000000
2:     b     a 0.1556371
3:     b     b 1.0000000

Alternatively, this form works just as well for the cross correlation between two sets (i.e. the block off diagonal)

library(data.table)
set.seed(1)             # reproducibility
dt1 <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8))
dt2 <- data.table(id=1:4, group=rep(letters[3:4], c(4,4)), value=rnorm(8))
setkey(dt1, group)
setkey(dt2, group)

dt1[, {x = value; g = group; dt2[, list(cor(x, value)), by = group]}, by=group]

   group group          V1
1:     a     c -0.39499814
2:     a     d  0.74234458
3:     b     c  0.96088312
4:     b     d  0.08016723

Obviously, if you ultimately want these in matrix form, then you can use dcast or dcast.data.table, however, notice that in the above examples you have two columns with the same name, to fix this it is worth renaming them in the j function. For the original problem:

dcast.data.table(dt[, {x = value; g1=group; dt[, list(g1, g2=group, c =cor(x, value)), by = group]}, by=group], g1~g2, value.var = "c")

   g1         a         b
1:  a 1.0000000 0.1556371
2:  b 0.1556371 1.0000000
Corvus
  • 7,548
  • 9
  • 42
  • 68