1

I have to calculate the pairwise weighted distance of a potentially large number of vectors. Let A be the m vectors/n atribute matrix, W is a 1xn weight vector.

My first function calculates the pairwise distance the naive way:

def weightedDistance(A,W):
    aff = np.zeros((len(A),len(A)))
    for i in range(len(A)):
        for j in range(len(A)):
            aff[i][j] = np.sum(((A[i] - A[j])**2)*W)
    return aff

This takes over twenty seconds for a 1000x2 matrix.

Then I tried a faster version, using matrix multiplication, based on this blog post:

def weightedDistanceFast(A,W):
    AA = np.sum((A**2)*W,1)
    A2 = -2*((A*W).dot(A.T))
    return (AA+A2).W+AA

This is "instant" for the same 1000x2 matrix. However, it introduces a bunch of errors at the order of e-17.

>>> dist = desc.weightedDistanceFast(attr,(1,1))
>>> dist2 = desc.weightedDistance(attr,(1,1))
>>> np.sum(dist-dist2)
6.9155863612166586e-11
>>> len(np.where((dist-dist2 > 0))[0])
380824

Checking other answers around here, I also tried to use the wminkowski distance from the sklearn.distance package, but it yielded different results from the above two functions depending on the value of W.

def weightedDistanceMinkowski(A,W):
    aff = sp.squareform(sp.pdist(A,metric='wminkowski',p=2,w=W))
    return aff**2

Anyway, I know that a certain degree of error is inevitable, but since I'm going to do further calculations on this distance matrix, I wonder if there is a way to reduce this error (or, alternatively, a less error-prone way to make function 1 faster).

Thank you!

Community
  • 1
  • 1
caranha
  • 11
  • 1
  • The real question is do you really need that level of precision? If not (likely!) just go for the speed! And if you did need it, then you'll probably have to worry more about the noise in your data than the numerical errors... – Julien Dec 22 '15 at 02:24
  • If necessary, you can increase the precision by changing an array's [data type](http://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html), e.g., `A = A.astype(np.float64)` or similar. – Reti43 Dec 22 '15 at 02:27
  • What makes you so confident that the first answer is the "correct" one? Both will have rounding errors, but if I had to guess I would expect the second answer to be closer to the true distance value. `np.dot` usually calls BLAS routines, which can make use of things like fused multiply-add operations to reduce the total number of floating point operations required and hence the rounding error. – ali_m Dec 22 '15 at 03:30
  • Thank you all for the comments -- Julien is quite correct in mentioning that if I care about this level of error I might have bigger problems. So I guess my zeroth question should be if these errors really matter, to which I say "I'm not sure yet". Time to think about it. ali_m: as for the correctness, the errors occur even on very small matrixes, which I did check by hand. This leads me to believe that it is numpy introducing the errors, but thanks for the comment! I'm still interested in the answer for curiosity's sake. I wonder if what Reti says is all there is to it. – caranha Jan 04 '16 at 02:03

0 Answers0