I want to compute the squared euclidean in a fast way, as described here:
What is the fastest way to compute an RBF kernel in python?
Note1: I am only interested in the distance, not the RBF kernel.
Note2: I am neglecting numexpr
here and only use numpy directly.
In short, I compute:
|| x - y ||^2 = ||x||^2 + ||y||^2 - 2. * (x @ y.T)
I am able to compute the distance matrix faster by a factor of ~10
compared to scipy.pdist
with this. However, I observe numerical issues, which get worse if I take the square root to get the euclidean distance. I have values that are in the order of 1E-8 - 1E-7
, which should be exactly zero (i.e. duplicated points or distance to self point).
Question:
Are there ways or ideas to overcome these numerical issues (perferable without sacrificing too much of the evaluation speed)? Or are the numerical issues the reason why this path is not taken (e.g. by scipy.pdist
) in the first place?
Example:
This is a small code example to show the numerical issues (not the speed ups, please look at the answers of the linked SO thread above).
import numpy as np
M = np.random.rand(1000, 10)
M_norm = np.sum(M**2, axis=1)
res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T
unique = np.unique(np.diag(res)) # analytically all diag values are exactly zero
sqrt_unique = np.sqrt(unique)
print(unique)
print(sqrt_unique)
Example output:
[-2.66453526e-15 -1.77635684e-15 -8.88178420e-16 -4.44089210e-16
0.00000000e+00 4.44089210e-16 8.88178420e-16 1.77635684e-15
3.55271368e-15]
[ nan nan nan nan
0.00000000e+00 2.10734243e-08 2.98023224e-08 4.21468485e-08
5.96046448e-08]
As you can see some values are also negative (which results in nan
after taking the sqrt). Of course these are easy to catch -- but the small positives have a large error for the euclidean case (e.g. abs_error=5.96046448e-08
)