0

Consider data.frame consisting of two ID variables and a third variable, relate, indicating the relationship (e.g. a covariance) between each pair of ID variables:

options(stringsAsFactors = FALSE, digits = 2)
set.seed(1)
(example <- data.frame(v1 = LETTERS[rep(c(2,3,1,4,7),
                                       times = c(3,2,2,1,2))],
                      v2 = LETTERS[c(2:4, 4:5, 4:5, 5,8,9)],
                      relate = rnorm(10)))

   v1 v2 relate
1   B  B  -0.63
2   B  C   0.18
3   B  D  -0.84
4   C  D   1.60
5   C  E   0.33
6   A  D  -0.82
7   A  E   0.49
8   D  E   0.74
9   G  H   0.58
10  G  I  -0.31

These data are unbalanced in that not every pairwise relationship is accounted for and not every in v1 is in v2 and vice versa.

The goal is to expand these data into a symmetric matrix where all the unique values in both v1 and v2 comprise the row and column names and the entries are the corresponding values of relate (if it exists, else NA). In other words, to populate the below matrix with values of relate:

vars <- with(example, unique(c(v1, v2)))
matrix(nrow = length(vars),
       ncol = length(vars),
       dimnames = list(vars, vars))
   B  C  A  D  G  E  H  I
B NA NA NA NA NA NA NA NA
C NA NA NA NA NA NA NA NA
A NA NA NA NA NA NA NA NA
D NA NA NA NA NA NA NA NA
G NA NA NA NA NA NA NA NA
E NA NA NA NA NA NA NA NA
H NA NA NA NA NA NA NA NA
I NA NA NA NA NA NA NA NA

However, the ready made solutions I've tried so far (data.table::dcast, reshape2:dcast) produce the following unbalanced results:

 data.table::dcast(example, v1~v2, value.var = 'relate')
  v1     B    C     D    E    H     I
1  A    NA   NA -0.82 0.49   NA    NA
2  B -0.63 0.18 -0.84   NA   NA    NA
3  C    NA   NA  1.60 0.33   NA    NA
4  D    NA   NA    NA 0.74   NA    NA
5  G    NA   NA    NA   NA 0.58 -0.31

Is there an efficient way to instead obtain a balanced, symmetric matrix of values from a set of ID-pairs and relating values?

One solution that occurs to me is to manually populate the above symmetric matrix of NA values using mapply and defining a subsetting function, but that seems unnecessarily cumbersome.

Richard Border
  • 3,209
  • 16
  • 30

1 Answers1

2

We can convert the 'v1/v2' columns to factor with levels specified as the unique elements in the unlisted 'v1/v2' columns in 'example', and then do dcast with drop=FALSE

library(data.table)
un1 <- unique(unlist(example[1:2]))
setDT(example)[, paste0("v", 1:2) := lapply(.SD, factor, levels = un1), 
            .SDcols = v1:v2]
dcast(example, v1~v2, value.var = "relate", drop = FALSE)
#   v1     B    C  A     D  G    E    H     I
#1:  B -0.63 0.18 NA -0.84 NA   NA   NA    NA
#2:  C    NA   NA NA  1.60 NA 0.33   NA    NA
#3:  A    NA   NA NA -0.82 NA 0.49   NA    NA
#4:  D    NA   NA NA    NA NA 0.74   NA    NA
#5:  G    NA   NA NA    NA NA   NA 0.58 -0.31
#6:  E    NA   NA NA    NA NA   NA   NA    NA
#7:  H    NA   NA NA    NA NA   NA   NA    NA
#8:  I    NA   NA NA    NA NA   NA   NA    NA
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Exactly what I was looking for--had tried specifying `drop = FALSE` but it hadn't occurred to specify additional factor levels. Cheers. – Richard Border Aug 13 '16 at 15:44