2

I have a data.table like this one:

set.seed(12345)
mydt <- data.table(gr1 = sample(letters[1:2], size = 100, replace = TRUE),
        gr2 = sample(letters[3:4], size = 100, replace = TRUE),
        a = rnorm(100), b = rnorm(100), weight = rnorm(100, 5, 1))

The gr1 and gr2 specify the group membership of each case. I would like to obtain the correlation matrix from the cov.wt function by the group membership specified by gr1 and gr2, using the weight column. The cov.wt returns the correlation matrix if cor = TRUE. I can split mydt by gr1 and gr2 and then use lapply to do the computations and extract each of the correlation matrices:

mydt <- split(x = mydt, by = c("gr1", "gr2"), drop = TRUE)

lapply(X = mydt, FUN = function(i) {
  cov.wt(x = as.matrix(i[ , c("a", "b")]), wt = i[ , weight], cor = TRUE)[["cor"]]
})

I get exactly what I want:

$b.c
                    a                   b
a 0.99999999999999978 0.26861150206539375
b 0.26861150206539375 0.99999999999999978

$a.c
                     a                    b
a  0.99999999999999978 -0.13281683546112405
b -0.13281683546112405  1.00000000000000000

$b.d
                     a                    b
a  1.00000000000000000 -0.13064774898011455
b -0.13064774898011455  1.00000000000000000

$a.d
                     a                    b
a  0.99999999999999978 -0.61122086293705469
b -0.61122086293705458  0.99999999999999978

However, with large datasets this approach is rather slow. I would like to use the data.table way to achieve this, just like the post from Dan Y under this question. I am struggling, thou, because of the extra parameters and the extraction of the correlation matrix from the list returned by the cov.wt function. I tried the following (plus many variations):

mydt[ , .(cov.wt(as.matrix(a, b), wt = weight, cor = TRUE)["cor"]), by = c("gr1", "gr2")]

What I get at the end is just the first value of the diagonal of each matrix.

What do I do wrong?

panman
  • 1,179
  • 1
  • 13
  • 33

1 Answers1

2

Here, the as.matrix is wrong because the 'x' is a single element and not multiple (based on the ?as.matrix). One option is to convert to matrix by cbinding the vectors 'a', 'b', then wrap the output in a list (with .())

library(data.table)
out <- mydt[ , .(.(cov.wt(cbind(a,b), wt = weight, cor = TRUE)["cor"])), 
    by = c("gr1", "gr2")]
out$V1
#[[1]]
#[[1]]$cor
#          a         b
#a 1.0000000 0.2686115
#b 0.2686115 1.0000000


#[[2]]
#[[2]]$cor
#           a          b
#a  1.0000000 -0.1328168
#b -0.1328168  1.0000000


#[[3]]
#[[3]]$cor
#           a          b
#a  1.0000000 -0.1306477
#b -0.1306477  1.0000000


#[[4]]
#[[4]]$cor
#           a          b
#a  1.0000000 -0.6112209
#b -0.6112209  1.0000000

NOTE: There is a difference in values due to the seed

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Oh, God, you are so fast! Thank you very much, this solves the problem. Thank you for the explanation too. – panman Apr 22 '20 at 18:29
  • Sorry, I have an additional question. I can pass wt as a character string (i.e. `get("weight")`, but how to pass a and b as a character string (i.e. `my.cols <- c("a", "b"))` to `cbind`, something like `cbind(my.cols)`. I tried different things, but it gives me an error "'x' must contain finite values only". – panman Apr 23 '20 at 12:35
  • I got it: my.cols <- c("a", "b") my.keys <- c("gr1", "gr2") out <- mydt[ , .(.(cov.wt(as.matrix(.SD), wt = get("weight"), cor = TRUE)["cor"])), by = my.keys, .SDcols = my.cols] out[["V1"]] – panman Apr 23 '20 at 12:47
  • 1
    Thank you! Works even faster than the solution I found and is more simple. – panman Apr 24 '20 at 09:32