0

I want to compute the squared euclidean in a fast way, as described here:

What is the fastest way to compute an RBF kernel in python?

Note1: I am only interested in the distance, not the RBF kernel.

Note2: I am neglecting numexpr here and only use numpy directly.

In short, I compute:

|| x - y ||^2 = ||x||^2 + ||y||^2 - 2. * (x @ y.T)

I am able to compute the distance matrix faster by a factor of ~10 compared to scipy.pdist with this. However, I observe numerical issues, which get worse if I take the square root to get the euclidean distance. I have values that are in the order of 1E-8 - 1E-7, which should be exactly zero (i.e. duplicated points or distance to self point).

Question:

Are there ways or ideas to overcome these numerical issues (perferable without sacrificing too much of the evaluation speed)? Or are the numerical issues the reason why this path is not taken (e.g. by scipy.pdist) in the first place?

Example:

This is a small code example to show the numerical issues (not the speed ups, please look at the answers of the linked SO thread above).

import numpy as np

M = np.random.rand(1000, 10)

M_norm = np.sum(M**2, axis=1)

res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T
unique = np.unique(np.diag(res))  # analytically all diag values are exactly zero 
sqrt_unique = np.sqrt(unique)


print(unique)
print(sqrt_unique)

Example output:

[-2.66453526e-15 -1.77635684e-15 -8.88178420e-16 -4.44089210e-16
  0.00000000e+00  4.44089210e-16  8.88178420e-16  1.77635684e-15
  3.55271368e-15]
[           nan            nan            nan            nan
 0.00000000e+00 2.10734243e-08 2.98023224e-08 4.21468485e-08
 5.96046448e-08]

As you can see some values are also negative (which results in nan after taking the sqrt). Of course these are easy to catch -- but the small positives have a large error for the euclidean case (e.g. abs_error=5.96046448e-08)

no_use123
  • 294
  • 3
  • 12
  • 1
    why are you doing this? `pdist` is over twice as fast as your code on my machine (4 vs 9ms) as well as not having these numerical stability issues. even after expanding out with `squareform` it's still faster. – Sam Mason Jan 19 '20 at 11:13
  • As I understand it floating point arithmetic (used by numpy) will never guarantee you an exact zero. If your machine supports it you can use np.longdouble for more precision, (abs_error~e-10). However, you have not given any indication of what a "large error" is in this context. – Calimocho Jan 19 '20 at 11:14
  • 1
    @Sam Mason this is a minimal example to show the numerical issues. The speed up is just background information, why I am doing it this way. If using `numexpr` and have more points and a larger point dimension, the described way is much faster. Please also look at the linked SO, where they properly look at the speed, I see similar speed ups, except that I do not compute the `exp`. -- They report pdist = 2 min, vs. numpy (this way) 24 sec. – no_use123 Jan 19 '20 at 11:23
  • @Calimocho yes, that is true, I am aware of the floating point issue. Sometimes it is possible to rearrange things and get it more stable. For this case, I couldn't find a way yet. pdist is able to have exact zeros, because they compute `d=x-x` directly. – no_use123 Jan 19 '20 at 11:43
  • The `np.longdouble` did not work. However, it gets exact when I cast down (i.e. from `np. float64` to `np.float32`). But this of course is a loss of precision generally. – no_use123 Jan 19 '20 at 11:47
  • @no_use123 sorry didn't read that page as I know about RBFs and assumed it was less directly related! think you're probably stuck just applying an `abs`. using `longdouble`s will significantly affect performance, which seems to be something your care about. why are you saying `pdist` has "exact zeros", it never actually calculates the diagonals... – Sam Mason Jan 19 '20 at 11:58
  • @SamMason you're right, I didn't write it precisely. pdist of course does not compute the diagonal, what I meant is, that pdist (or cdist) is able to detect point duplicates in the point-matrix with zero (exact). – no_use123 Jan 19 '20 at 12:47

1 Answers1

0

as per my comment, using abs is probably your best option for cleaning up the numerical stability inherent in this algorithm. as you're concerned about performance you should probably be using the mutating assignment operators as they cause less garbage to be created and hence can be much faster. also, when running this with many features (e.g. 10k) I see pdist being slower than this implementation.

putting the above together we get:

import numpy as np

def edist0(M):
    "calculate pairwise euclidean distance"
    M_norm = np.sum(M**2, axis=1)
    res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T    
    return np.sqrt(np.abs(res))

def edist1(M):
    "optimised calculation of pairwise euclidean distance"
    M_norm = np.einsum('ij,ij->i', M, M)
    res = M @ M.T
    res *= -2.
    res += M_norm[:, np.newaxis]
    res += M_norm[np.newaxis, :]
    return np.sqrt(np.abs(res, out=res), out=res)

timing this in IPython with:

from scipy.spatial import distance

M = np.random.rand(1000, 10000)

%timeit distance.squareform(distance.pdist(M))
%timeit edist0(M)
%timeit edist1(M)

I get:

2.82 s ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
296 ms ± 6.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
153 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

and no errors/warnings from sqrt

the linked question also points to scikit-learn as having good distance kernel good implementations, the euclidean one being pairwise_distances which benchmarks as:

from sklearn.metrics import pairwise_distances

%timeit pairwise_distances(M)

170 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

which might be nice to use if you're already using that package

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • I still observe some positive values on the diagonal that should be exactly zero. However, sklearn is able to get it right (exact zeros) and fast, so thank you for pointing out sklearn again, I can work with this. – no_use123 Jan 19 '20 at 12:52
  • scikit learn ensures the diagonal is zero but still gives nonzero values for rows that are the same. i.e. the diagonal being zero is a hack, it's not because it's being careful about numerical stability – Sam Mason Jan 19 '20 at 12:57
  • Ah right, I can confirm this. I guess this is a tradeoff between speed and accuracy. I think, that there is no numerical exact way, so I leave you answer green. – no_use123 Jan 19 '20 at 13:09