0

Hi Everyone I am trying to write code (using python 2) that returns a matrix that contains the distance between all pairs of rows. Below is an implementation that I have written. It works as expected but can get very slow as the number of rows gets large. Hence I was wondering if anyone has any suggestions as to how the code can be made more efficient for large number of rows.

Thanks in advance

def gendist(x,alpha=2):
    (n,p) = x.shape
    len = 0
    for ii in range(1,n):
        len = len + ii
    d = np.empty((len,p))
    ind = 0
    for ii in range(0,n):
        for jj in range(1,n):
            if ii < jj:
                d[ind,] = (x[ii,]-x[jj,])**alpha
                ind = ind + 1
    return d

3 Answers3

0

I see you use X.shape, for me, it is find to assume that you are using NumPy

Code:

#!/usr/bin/env python3
import numpy as np
import scipy.spatial.distance as dist

a = np.random.randint(0, 10, (5, 3))
b = dist.pdist(a)
print('Matrix:')
print(a)
print('Pdist')
for d in b:
    print(d)

Output:

Matrix:
[[4 7 6]
 [8 2 8]
 [8 3 5]
 [2 4 7]
 [0 7 5]]
Pdist
6.7082039325
5.74456264654
3.74165738677
4.12310562562
3.16227766017
6.40312423743
9.89949493661
6.40312423743
8.94427191
4.12310562562

where the order of combinations is (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), ...

The default metric is the Euclidean distance. See pdist to apply other metrics.

Jeon
  • 4,000
  • 4
  • 28
  • 73
0

Without scipy (it is possible to get numpy without scipy, for instance with an Abaqus install) it's a bit more difficult.

def gendist(x,alpha=2):
    xCopies=x.repeat(x.shape[0],axis=0).reshape(np.conatenate(([a.shape[0]],a.shape))
    #n x n x p matrix filled with copies of x
    xVecs=xCopies-xCopies.swapaxes(0,1) #matrix of distance vectors
    xDists=np.sum(xVecs**alpha,axis=-1)**(1/alpha) #n x n matrix of distances
    Return xDists

That should be robust, at least it's what I had to use.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
0

I think what you're looking for is sklearn pairwise_distances. scipy distance_matrix takes ~115 sec on my machine to compute a 10Kx10K distance matrix on 512-dimensional vectors. scipy cdist takes ~50 sec. sklearn pairwise_distances takes ~9 sec. From the documentation:

Note that in the case of ‘cityblock’, ‘cosine’ and ‘euclidean’ (which are valid scipy.spatial.distance metrics), the scikit-learn implementation will be used, which is faster and has support for sparse matrices (except for ‘cityblock’).

Kai
  • 1,464
  • 4
  • 18
  • 31