84

scipy.spatial.distance.pdist returns a condensed distance matrix. From the documentation:

Returns a condensed distance matrix Y. For each and (where ), the metric dist(u=X[i], v=X[j]) is computed and stored in entry ij.

I thought ij meant i*j. But I think I might be wrong. Consider

X = array([[1,2], [1,2], [3,4]])
dist_matrix = pdist(X)

then the documentation says that dist(X[0], X[2]) should be dist_matrix[0*2]. However, dist_matrix[0*2] is 0 -- not 2.8 as it should be.

What's the formula I should use to access the similarity of a two vectors, given i and j?

feedMe
  • 3,431
  • 2
  • 36
  • 61
Rafael Almeida
  • 2,377
  • 3
  • 22
  • 32

7 Answers7

107

You can look at it this way: Suppose x is m by n. The possible pairs of m rows, chosen two at a time, is itertools.combinations(range(m), 2), e.g, for m=3:

>>> import itertools
>>> list(combinations(range(3),2))
[(0, 1), (0, 2), (1, 2)]

So if d = pdist(x), the kth tuple in combinations(range(m), 2)) gives the indices of the rows of x associated with d[k].

Example:

>>> x = array([[0,10],[10,10],[20,20]])
>>> pdist(x)
array([ 10.        ,  22.36067977,  14.14213562])

The first element is dist(x[0], x[1]), the second is dist(x[0], x[2]) and the third is dist(x[1], x[2]).

Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array.

E.g.

>>> squareform(pdist(x)) 
array([[  0.   ,  10.   ,  22.361],
       [ 10.   ,   0.   ,  14.142],
       [ 22.361,  14.142,   0.   ]])

>>> y = array([[0,10],[10,10],[20,20],[10,0]])
>>> squareform(pdist(y)) 
array([[  0.   ,  10.   ,  22.361,  14.142],
       [ 10.   ,   0.   ,  14.142,  10.   ],
       [ 22.361,  14.142,   0.   ,  22.361],
       [ 14.142,  10.   ,  22.361,   0.   ]])
>>> pdist(y)
array([ 10.   ,  22.361,  14.142,  14.142,  10.   ,  22.361])
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • 1
    I see, interesting. The squareform is easier to use, it seems. sq_form[i,j] will get me exactly the distance between y[i] and y[j]. However, I think the condensed form is better memory-wise. Perhaps I should read a little bit more on what squareform does. But there isn't a simple formula which converts i,j into a dist position, then? – Rafael Almeida Oct 26 '12 at 02:43
  • 2
    Is this actually documented behavior? It makes sense, sure, but nothing in the API makes it seem like it should collate with `combinations(range(m), 2))`, which corresponds to the lower triangle of the distance matrix. Why not the upper? – VF1 Dec 30 '16 at 04:42
  • Why do you say it corresponds to the lower triangle? For example, `list(combinations(range(4), 2))` gives `[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]`. Each tuple in that list has the form (row_index, column_index), so it corresponds to the upper triangle. – Warren Weckesser Dec 30 '16 at 05:18
52

Condensed distance matrix to full distance matrix

A condensed distance matrix as returned by pdist can be converted to a full distance matrix by using scipy.spatial.distance.squareform:

>>> import numpy as np
>>> from scipy.spatial.distance import pdist, squareform
>>> points = np.array([[0,1],[1,1],[3,5], [15, 5]])
>>> dist_condensed = pdist(points)
>>> dist_condensed
array([  1.        ,   5.        ,  15.5241747 ,   4.47213595,
        14.56021978,  12.        ])

Use squareform to convert to full matrix:

>>> dist = squareform(dist_condensed)
array([[  0.        ,   1.        ,   5.        ,  15.5241747 ],
       [  1.        ,   0.        ,   4.47213595,  14.56021978],
       [  5.        ,   4.47213595,   0.        ,  12.        ],
       [ 15.5241747 ,  14.56021978,  12.        ,   0.        ]])

Distance between point i,j is stored in dist[i, j]:

>>> dist[2, 0]
5.0
>>> np.linalg.norm(points[2] - points[0])
5.0

Indices to condensed index

One can convert indices used for accessing the elements of the square matrix to the index in the condensed matrix:

def square_to_condensed(i, j, n):
    assert i != j, "no diagonal elements in condensed matrix"
    if i < j:
        i, j = j, i
    return n*j - j*(j+1)//2 + i - 1 - j

Example:

>>> square_to_condensed(1, 2, len(points))
3
>>> dist_condensed[3]
4.4721359549995796
>>> dist[1,2]
4.4721359549995796

Condensed index to indices

Also the other direction is possible without sqaureform which is better in terms of runtime and memory consumption:

import math

def calc_row_idx(k, n):
    return int(math.ceil((1/2.) * (- (-8*k + 4 *n**2 -4*n - 7)**0.5 + 2*n -1) - 1))

def elem_in_i_rows(i, n):
    return i * (n - 1 - i) + (i*(i + 1))//2

def calc_col_idx(k, i, n):
    return int(n - elem_in_i_rows(i + 1, n) + k)

def condensed_to_square(k, n):
    i = calc_row_idx(k, n)
    j = calc_col_idx(k, i, n)
    return i, j

Example:

>>> condensed_to_square(3, 4)
(1.0, 2.0)

Runtime comparison with squareform

>>> import numpy as np
>>> from scipy.spatial.distance import pdist, squareform
>>> points = np.random.random((10**4,3))
>>> %timeit dist_condensed = pdist(points)
1 loops, best of 3: 555 ms per loop

Creating the sqaureform turns out to be really slow:

>>> dist_condensed = pdist(points)
>>> %timeit dist = squareform(dist_condensed)
1 loops, best of 3: 2.25 s per loop

If we are searching two points with maximum distance it is not surprising that searching in full matrix is O(n) while in condensed form only O(n/2):

>>> dist = squareform(dist_condensed)
>>> %timeit dist_condensed.argmax()
10 loops, best of 3: 45.2 ms per loop
>>> %timeit dist.argmax()
10 loops, best of 3: 93.3 ms per loop

Getting the inideces for the two points takes almost no time in both cases but of course there is some overhead for calculating the condensed index:

>>> idx_flat = dist.argmax()
>>> idx_condensed = dist.argmax()
>>> %timeit list(np.unravel_index(idx_flat, dist.shape))
100000 loops, best of 3: 2.28 µs per loop
>>> %timeit condensed_to_square(idx_condensed, len(points))
100000 loops, best of 3: 14.7 µs per loop
lumbric
  • 7,644
  • 7
  • 42
  • 53
  • 2
    I have a distance matrix in square form -- is there a function to convert to condensed form? (i.e. does the opposite of `squareform`) ... Functions like [linkage](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html) expect condensed form... **EDIT** The `squareform` function will go both ways ... *Way* cool... ["... **and vice-versa.**"](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.squareform.html) **EDIT 2** I called "square-form" should be "redundant" **EDIT 3** and linkage will work with *both* forms... – Nate Anderson Jul 24 '16 at 20:25
  • 1
    @The Red Pea squareform is its own inverse (i.e. when run on a complete distance matrix, it converts it to squareform) – moustachio Oct 19 '16 at 21:16
  • 1
    How did you figure out `calc_row_idx` and `calc_col_idx`? – CMCDragonkai Jun 06 '18 at 06:07
  • @CMCDragonkai if I recall correctly the idea is to take the formula from `square_to_condensed()` and to solve by i or j by using the quadratic formula. The calculation is a bit cumbersome as far as I remember, but it involves only basic algebra and some reasoning which solution is impossible (because it is negative or complex). Drawing an example triangle matrix with empty diagonal helps a bit. – lumbric Jun 09 '18 at 09:53
  • 1
    In Python 3, "Indices to condensed index" should use integer division `//` instead of floating-point division `/`, so that the output is an integer. – mic Apr 17 '20 at 16:37
  • @mic thanks for pointing that out! I've updated the answer. I have to admit, I was still using Python 2.x in 2016 in my day job. – lumbric Apr 19 '20 at 11:56
18

The vector of the compressed matrix corresponds to the bottom triangular region of the square matrix. To convert for a point in that triangular region, you need to calculate the number of points to the left in the triangle, and the number above in the column.

You can use the following function to convert:

q = lambda i,j,n: n*j - j*(j+1)/2 + i - 1 - j

Check:

import numpy as np
from scipy.spatial.distance import pdist, squareform
x = np.random.uniform( size = 100 ).reshape( ( 50, 2 ) )
d = pdist( x )
ds = squareform( d )
for i in xrange( 1, 50 ):
    for j in xrange( i ):
        assert ds[ i, j ] == d[ q( i, j, 50 ) ]
shaunc
  • 5,317
  • 4
  • 43
  • 58
  • 2
    Note that it really is the *bottom* triangular region, which might be strange to some. – shaunc Apr 25 '14 at 16:22
  • 2
    The bottom triangle is the transpose of the upper triangle because the distance matrix is symemtric, i.e. swapping j,i -> i,j gives identical results. Your solution uses the lower triangle interpretation, but there's nothing incorrect about the upper triangle version (which I think is the more common way people think about this) – David Marx Oct 30 '14 at 22:10
  • I'm trying to move in the opposite direction: given an index into the condensed distance matrix (i.e. the flat vector), how can I get the matrix indexes (i,j) that correspond to that value without coercing it to the square form? – David Marx Oct 30 '14 at 22:13
  • squareform serialises the distance vector by rows. this means the smaller index is always the first: (0,1),(0,2),(0,3).... –  Nov 18 '17 at 11:24
9

I had the same question. And I found it simpler to use numpy.triu_indices:

import numpy as np
from scipy.spatial.distance import pdist, squareform
N = 10

# Calculate distances
X = np.random.random((N,3))
dist_condensed = pdist(X)

# Get indexes: matrix indices of dist_condensed[i] are [a[i],b[i]]
a,b = np.triu_indices(N,k=1)

# Fill distance matrix
dist_matrix = np.zeros((N,N))
for i in range(len(dist_condensed)):
    dist_matrix[a[i],b[i]] = dist_condensed[i]
    dist_matrix[b[i],a[i]] = dist_condensed[i]

# Compare with squareform output
np.all(dist_matrix == squareform(dist_condensed))
Rustam Guliev
  • 936
  • 10
  • 15
6

This is the upper triangle version (i < j), which must be interesting to some:

condensed_idx = lambda i,j,n: i*n + j - i*(i+1)/2 - i - 1

This is very easy to understand:

  1. with i*n + j you go to the position in the square-formed matrix;
  2. with - i*(i+1)/2 you remove lower triangle (including diagonal) in all lines before i;
  3. with - i you remove positions in line i before the diagonal;
  4. with - 1 you remove positions in line i on the diagonal.

Check:

import scipy
from scipy.spatial.distance import pdist, squareform
condensed_idx = lambda i,j,n: i*n + j - i*(i+1)/2 - i - 1
n = 50
dim = 2
x = scipy.random.uniform(size = n*dim).reshape((n, dim))
d = pdist(x)
ds = squareform(d)
for i in xrange(1, n-1):
    for j in xrange(i+1, n):
        assert ds[i, j] == d[condensed_idx(i, j, n)]
HongboZhu
  • 4,442
  • 3
  • 27
  • 33
5

If someone is looking for an inverse transformation (i.e. given an element index idx, figure out which (i, j) element corresponds to it), here is a resonably vectored solution:

def actual_indices(idx, n):
    n_row_elems = np.cumsum(np.arange(1, n)[::-1])
    ii = (n_row_elems[:, None] - 1 < idx[None, :]).sum(axis=0)
    shifts = np.concatenate([[0], n_row_elems])
    jj = np.arange(1, n)[ii] + idx - shifts[ii]
    return ii, jj

n = 5
k = 10
idx = np.random.randint(0, n, k)
a = pdist(np.random.rand(n, n))
b = squareform(a)

ii, jj = actual_indices(idx, n)]
assert np.allclose(b[ii, jj, a[idx])

I used it to figure out indexes of closest rows in a matrix.

m = 3  # how many closest
lowest_dist_idx = np.argpartition(-a, -m)[-m:]
ii, jj = actual_indices(lowest_dist_idx, n)  # rows ii[0] and jj[0] are closest
Ben Usman
  • 7,969
  • 6
  • 46
  • 66
4

If you want to access the element of pdist corresponding to the (i,j)-th element of the square distance matrix, the math is as follows: Assume i < j (otherwise flip indices) if i == j, the answer is 0.

X = random((N,m))
dist_matrix = pdist(X)

Then the (i,j)-th element is dist_matrix[ind] where

ind = (N - array(range(1,i+1))).sum()  + (j - 1 - i).
lumbric
  • 7,644
  • 7
  • 42
  • 53
user1462620
  • 288
  • 2
  • 9
  • 1
    Note that `array(range(1,i+1))).sum() == ((i+1)*i)/2` (google for "Young Gauss"). – lumbric Apr 26 '16 at 12:57
  • If I have an array `array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])` with N = 5, your formula at row i = 3 and column j = 2 `(5 - np.array(range(1,3+1))).sum() + (2 - 1 - 3)` would give me 7, and should give me 5. – Veronica Jan 08 '21 at 21:46