10

For a set of observations:

[a1,a2,a3,a4,a5]

their pairwise distances

d=[[0,a12,a13,a14,a15]
   [a21,0,a23,a24,a25]
   [a31,a32,0,a34,a35]
   [a41,a42,a43,0,a45]
   [a51,a52,a53,a54,0]]

Are given in a condensed matrix form (upper triangular of the above, calculated from scipy.spatial.distance.pdist ):

c=[a12,a13,a14,a15,a23,a24,a25,a34,a35,a45]

The question is, given that I have the index in the condensed matrix is there a function (in python preferably) f to quickly give which two observations were used to calculate them?

f(c,0)=(1,2)
f(c,5)=(2,4)
f(c,9)=(4,5)
...

I have tried some solutions but none worth mentioning :(

Ηλίας
  • 2,560
  • 4
  • 30
  • 44

7 Answers7

28

The formula for an index of the condensed matrix is

index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2 + j - i - 1

Where i is the row index, j is the column index, and d is the row length of the original (d X d) upper triangular matrix.

Consider the case when the index refers to the leftmost, non-zero entry of some row in the original matrix. For all the leftmost indices,

j == i + 1

so

index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2 + i + 1 - i - 1
index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2

With some algebra, we can rewrite this as

i ** 2 + (1 - (2 * d)) * i + 2 * index == 0

Then we can use the quadratic formula to find the roots of the equation, and we only are going to care about the positive root.

If this index does correspond to leftmost, non-zero cell, then we get a positive integer as a solution that corresponds to the row number. Then, finding the column number is just arithmetic.

j = index - d * (d - 1) / 2 + (d - i) * (d - i - 1)/ 2 + i + 1

If the index does not correspond to the leftmost, non-zero cell, then we will not find an integer root, but we can take the floor of the positive root as the row number.

def row_col_from_condensed_index(d,index):
    b = 1 - (2 * d) 
    i = (-b - math.sqrt(b ** 2 - 8 * index)) // 2
    j = index + i * (b + i + 2) // 2 + 1
    return (i,j)  

If you don't know d, you can figure it from the length of the condensed matrix.

((d - 1) * d) / 2 == len(condensed_matrix)
d = (1 + math.sqrt(1 + 8 * len(condensed_matrix))) // 2 
fgregg
  • 3,173
  • 30
  • 37
  • 4
    I had to search a long time to find this. Your answer deserves more attention. PS: if you swap out `math` for `numpy`, your solution is actually vectorized. – David Marx Oct 31 '14 at 01:52
  • 2
    I think the problem may that the question title is not so clear. Do you have a suggestion for a better title? – fgregg Oct 31 '14 at 15:48
4

You may find triu_indices useful. Like,

In []: ti= triu_indices(5, 1)
In []: r, c= ti[0][5], ti[1][5]
In []: r, c
Out[]: (1, 3)

Just notice that indices starts from 0. You may adjust it as you like, for example:

In []: def f(n, c):
   ..:     n= ceil(sqrt(2* n))
   ..:     ti= triu_indices(n, 1)
   ..:     return ti[0][c]+ 1, ti[1][c]+ 1
   ..:
In []: f(len(c), 5)
Out[]: (2, 4)
eat
  • 7,440
  • 1
  • 19
  • 27
  • 1
    This works, although it won't scale up. More than 10k of 2 dimensional observations will fill up the memory – Ηλίας Mar 16 '11 at 11:11
  • @Ηλίας: Care to elaborate more, assuming your condensed matrix data type is double, then triu_indices consume same amount of memory. – eat Mar 16 '11 at 12:01
  • @eat `from scipy.spatial.distance import pdist`, the `pdist` would happily crunch up to 10k of data. And your function would go up to 10.000.000 size. So I take back my comment! The problem was on pdist – Ηλίας Mar 16 '11 at 14:53
  • @Ηλίας: You may describe on a separate question what you are aiming for. Is it absolutely necessary to calculate all pairwise distances? Thanks – eat Mar 16 '11 at 15:51
  • @eat the goal is to quickly the pair that has the min distance http://stackoverflow.com/q/5119644/188368 – Ηλίας Mar 17 '11 at 10:27
  • 3
    No doubt that this solution is inefficient for even moderate 'n' sizes. – Developer Dec 26 '12 at 10:40
2

To complete the list of answers to this question: A fast, vectorized version of fgreggs answer (as suggested by David Marx) could look like this:

def vec_row_col(d,i):                                                                
    i = np.array(i)                                                                 
    b = 1 - 2 * d                                                                   
    x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)                                      
    y = (i + x*(b + x + 2)/2 + 1).astype(int)                                                    
    if i.shape:                                                                     
        return zip(x,y)                                                             
    else:                                                                           
        return (x,y) 

I needed to do these calculations for huge arrays, and the speedup as compared to the un-vectorized version (https://stackoverflow.com/a/14839010/3631440) is (as usual) quite impressive (using IPython %timeit):

import numpy as np
from scipy.spatial import distance

test = np.random.rand(1000,1000)
condense = distance.pdist(test)
sample = np.random.randint(0,len(condense), 1000)

%timeit res = vec_row_col(1000, sample)
10000 loops, best of 3: 156 µs per loop

res = []
%timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
100 loops, best of 3: 5.87 ms per loop

That's about 37 times faster in this example!

seralouk
  • 30,938
  • 9
  • 118
  • 133
chris-sc
  • 1,698
  • 11
  • 21
  • There's a syntax error with an extra `(`. Also why would `i` have a shape? The condensed distance matrix is always a 1d array. – CMCDragonkai Jun 06 '18 at 06:20
  • **AMAZING ANSWER**. thanks. I only modified it to `return zip(x,y)` so that I get the output in a list – seralouk Jun 18 '20 at 09:26
2

Cleary, the function f you are searching for, needs a second argument: the dimension of the matrix - in your case: 5

First Try:

def f(dim,i): 
  d = dim-1 ; s = d
  while i<s: 
    s+=d ; d-=1
  return (dim-d, i-s+d)
phynfo
  • 4,830
  • 1
  • 25
  • 38
0

This is in addition to the answer provided by phynfo and your comment. It does not feel like a clean design to me to infer the dimension of the matrix from the length of the compressed matrix. That said, here is how you can compute it:

from math import sqrt, ceil

for i in range(1,10):
   thelen = (i * (i+1)) / 2
   thedim = sqrt(2*thelen + ceil(sqrt(2*thelen)))
   print "compressed array of length %d has dimension %d" % (thelen, thedim)

The argument to the outer square root should always be a square integer, but sqrt returns a floating point number, so some care is needed when using this.

micans
  • 1,106
  • 7
  • 16
-1

To improve the efficiency using numpy.triu_indices
use this:

def PdistIndices(n,I):
    '''idx = {} indices for pdist results'''
    idx = numpy.array(numpy.triu_indices(n,1)).T[I]
    return idx

So I is an array of indices.

However a better solution is to implement an optimized Brute-force search, say, in Fortran:

function PdistIndices(n,indices,m) result(IJ)
    !IJ = {} indices for pdist[python] selected results[indices]
    implicit none
    integer:: i,j,m,n,k,w,indices(0:m-1),IJ(0:m-1,2)
    logical:: finished
    k = 0; w = 0; finished = .false.
    do i=0,n-2
        do j=i+1,n-1
            if (k==indices(w)) then
                IJ(w,:) = [i,j]
                w = w+1
                if (w==m) then
                    finished = .true.
                    exit
                endif
            endif
            k = k+1
        enddo
        if (finished) then
            exit
        endif
    enddo
end function

then compile using F2PY and enjoy unbeatable performance. ;)

Developer
  • 8,258
  • 8
  • 49
  • 58
-1

Here's another solution:

import numpy as np

def f(c,n):
    tt = np.zeros_like(c)
    tt[n] = 1
    return tuple(np.nonzero(squareform(tt))[0])
JoshAdel
  • 66,734
  • 27
  • 141
  • 140