Consider data.frame
consisting of two ID variables and a third variable, relate
, indicating the relationship (e.g. a covariance) between each pair of ID variables:
options(stringsAsFactors = FALSE, digits = 2)
set.seed(1)
(example <- data.frame(v1 = LETTERS[rep(c(2,3,1,4,7),
times = c(3,2,2,1,2))],
v2 = LETTERS[c(2:4, 4:5, 4:5, 5,8,9)],
relate = rnorm(10)))
v1 v2 relate
1 B B -0.63
2 B C 0.18
3 B D -0.84
4 C D 1.60
5 C E 0.33
6 A D -0.82
7 A E 0.49
8 D E 0.74
9 G H 0.58
10 G I -0.31
These data are unbalanced in that not every pairwise relationship is accounted for and not every in v1
is in v2
and vice versa.
The goal is to expand these data into a symmetric matrix where all the unique values in both v1
and v2
comprise the row and column names and the entries are the corresponding values of relate
(if it exists, else NA
). In other words, to populate the below matrix with values of relate
:
vars <- with(example, unique(c(v1, v2)))
matrix(nrow = length(vars),
ncol = length(vars),
dimnames = list(vars, vars))
B C A D G E H I
B NA NA NA NA NA NA NA NA
C NA NA NA NA NA NA NA NA
A NA NA NA NA NA NA NA NA
D NA NA NA NA NA NA NA NA
G NA NA NA NA NA NA NA NA
E NA NA NA NA NA NA NA NA
H NA NA NA NA NA NA NA NA
I NA NA NA NA NA NA NA NA
However, the ready made solutions I've tried so far (data.table::dcast
, reshape2:dcast
) produce the following unbalanced results:
data.table::dcast(example, v1~v2, value.var = 'relate')
v1 B C D E H I
1 A NA NA -0.82 0.49 NA NA
2 B -0.63 0.18 -0.84 NA NA NA
3 C NA NA 1.60 0.33 NA NA
4 D NA NA NA 0.74 NA NA
5 G NA NA NA NA 0.58 -0.31
Is there an efficient way to instead obtain a balanced, symmetric matrix of values from a set of ID-pairs and relating values?
One solution that occurs to me is to manually populate the above symmetric matrix of NA
values using mapply
and defining a subsetting function, but that seems unnecessarily cumbersome.