Distance matrix to pairwise distance list in R

Question

Is there any R package to obtain a pairwise distance list if my input file is a distance matrix For eg, if my input is a data.frame like this:

        A1      B1      C1      D1
 A1     0      0.85    0.45    0.96 
 B1            0       0.85    0.56
 C1                    0       0.45
 D1                            0

I want the output as:

A1  B1  0.85
A1  C1  0.45
A1  D1  0.96
B1  C1  0.85
B1  D1  0.56
C1  D1  0.45

I found a question to do the opposite function using package 'reshape' but could not tweak it to get what I wanted.

Please post the output of `dput(your-distance-object)` so we are not guessing whether you are actually dealing with a `data.frame`, a `matrix`, a `table`, an actual distance matrix, or something else entirely. This would definitely influence the applicability of the answers presented so far. I ask this because your title says "distance matrix" (which is generally created using the `dist` function), but your question description says you're dealing with a `data.frame`. These are quite different. — A5C1D2H2I1M1N2O1R2T1, Jan 12 '15 at 04:38
I'm also suspicious about this... distance matrices generated with `dist` print the lower triangle by default, not the upper triangle. And are your blank cells `NA`, or simply hidden (as with the `print` method for `dist` objects)? — jbaums, Jan 12 '15 at 04:55

score 14 · Answer 1 · answered Jan 12 '15 at 04:54

A couple of other options:

Generate some data

D <- dist(cbind(runif(4), runif(4)), diag=TRUE, upper=TRUE) # generate dummy data
m <- as.matrix(D) # coerce dist object to a matrix
dimnames(m) <- dimnames(m) <- list(LETTERS[1:4], LETTERS[1:4])

Assuming you just want the distances for pairs defined by the upper triangle of the distance matrix, you can do:

xy <- t(combn(colnames(m), 2))
data.frame(xy, dist=m[xy])

#  X1 X2      dist
# 1 A  B 0.3157942
# 2 A  C 0.5022090
# 3 A  D 0.3139995
# 4 B  C 0.1865181
# 5 B  D 0.6297772
# 6 C  D 0.8162084

Alternatively, if you want distances for all pairs (in both directions):

data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(m)], dist=c(m))

#    col row      dist
# 1    A   A 0.0000000
# 2    A   B 0.3157942
# 3    A   C 0.5022090
# 4    A   D 0.3139995
# 5    B   A 0.3157942
# 6    B   B 0.0000000
# 7    B   C 0.1865181
# 8    B   D 0.6297772
# 9    C   A 0.5022090
# 10   C   B 0.1865181
# 11   C   C 0.0000000
# 12   C   D 0.8162084
# 13   D   A 0.3139995
# 14   D   B 0.6297772
# 15   D   C 0.8162084
# 16   D   D 0.0000000

or the following, which excludes any NA distances, but doesn't keep the column/row names (though this would be easy to rectify since we have the column/row indices):

data.frame(which(!is.na(m), arr.ind=TRUE, useNames=FALSE), dist=c(m))

I get the following error msg. Any idea why ? Error in m[xy] : subscript out of bounds — Anurag Mishra, Jan 12 '15 at 10:49
@AnuragMishra When you run my code? Or when you apply it to your data? — jbaums, Jan 12 '15 at 11:18
@AnuragMishra Please edit your question and add the output of `dput(d)`, where `d` is your dataframe. If `d` is too large to include in this way, then provide a small subset of it for us to work with. — jbaums, Jan 12 '15 at 12:15
I am using two columns from a data frame as the X and Y coordinates to find distances. dput() gives me the following Size = 121L, Diag = TRUE, Upper = TRUE, method = "euclidean", call = dist(x = cbind(x$da1, x$da2), diag = TRUE, upper = TRUE), class = "dist") x$da1 and x$da2 are my two columns from the data frame 'x' Is this what you wanted ? — Anurag Mishra, Jan 14 '15 at 06:58

J.R. · Answer 2 · 2015-01-11T22:24:22.917

If you have a data.frame you could do something like:

df <- structure(list(A1 = c(0, 0, 0, 0), B1 = c(0.85, 0, 0, 0), C1 = c(0.45, 
0.85, 0, 0), D1 = c(0.96, 0.56, 0.45, 0)), .Names = c("A1", "B1", 
"C1", "D1"), row.names = c(NA, -4L), class = "data.frame")

data.frame( t(combn(names(df),2)), dist=t(df)[lower.tri(df)] )
  X1 X2 dist
1 A1 B1 0.85
2 A1 C1 0.45
3 A1 D1 0.96
4 B1 C1 0.85
5 B1 D1 0.56
6 C1 D1 0.45

Another approach if you have it as a matrix with row+col-names is to use reshape2 directly:

mat <- structure(c(0, 0, 0, 0, 0.85, 0, 0, 0, 0.45, 0.85, 0, 0, 0.96, 
0.56, 0.45, 0), .Dim = c(4L, 4L), .Dimnames = list(c("A1", "B1", 
"C1", "D1"), c("A1", "B1", "C1", "D1")))

library(reshape2)
subset(melt(mat), value!=0)

   Var1 Var2 value
5    A1   B1  0.85
9    A1   C1  0.45
10   B1   C1  0.85
13   A1   D1  0.96
14   B1   D1  0.56
15   C1   D1  0.45

Colonel Beauvel · Answer 3 · 2015-01-11T22:01:28.293

I suppose you have a contingency table or a matrix defined as follow:

mat = matrix(c(0, 0.85, 0.45, 0.96, NA, 0, 0.85, 0.56, NA, NA, 0, 0.45, NA,NA,NA,0), ncol=4)
cont = as.table(t(mat))

#     A    B    C    D
#A 0.00 0.85 0.45 0.96
#B      0.00 0.85 0.56
#C           0.00 0.45
#D                0.00

Then you simply need a data.frame conversion, and remove NA/0's:

df = as.data.frame(cont)
df = df[complete.cases(df),]
df[df[,3]!=0,]

#   Var1 Var2 Freq
#5     A    B 0.85
#9     A    C 0.45
#10    B    C 0.85
#13    A    D 0.96
#14    B    D 0.56
#15    C    D 0.45

jmuhlenkamp · Answer 4 · 2022-05-10T22:11:56.697

Tidymodels Answer

This is exactly the type of thing that the broom package excels at. It is a tidymodels package.

Borrowing the dummy data from jbaums answer.

D <- dist(cbind(runif(4), runif(4))) # generate dummy data

This is a single function call.

library(broom)
tidy(D)

Which returns

 A tibble: 6 x 3
  item1 item2 distance
  <fct> <fct>    <dbl>
1 1     2        0.702
2 1     3        0.270
3 1     4        0.292
4 2     3        0.960
5 2     4        0.660
6 3     4        0.510

Note, it also works for different values of diag and upper as well.

tidy(dist(cbind(runif(4), runif(4)), diag=TRUE, upper=TRUE))
tidy(dist(cbind(runif(4), runif(4)), diag=FALSE, upper=TRUE))
tidy(dist(cbind(runif(4), runif(4)), diag=TRUE, upper=FALSE))

Aeck · Answer 5 · 2015-01-12T02:46:44.677

Here is an example using the spaa-package.

exampleInput <- structure(list(A1 = c(0, 0, 0, 0), B1 = c(0.85, 0, 0, 0), 
C1 = c(0.45, 0.85, 0, 0), D1 = c(0.96, 0.56, 0.45, 0)), 
.Names = c("A1", "B1", "C1", "D1"), row.names = c(NA, -4L), class = "data.frame")

library(spaa)
pairlist <- dist2list(as.dist(t(exampleInput)))
pairlist[as.numeric(pairlist$col) > as.numeric(pairlist$row),]

Output:

   col row value
2   B1  A1  0.85
3   C1  A1  0.45
4   D1  A1  0.96
7   C1  B1  0.85
8   D1  B1  0.56
12  D1  C1  0.45

Distance matrix to pairwise distance list in R

5 Answers5

Tidymodels Answer