I am working on a data science project in which I have to compute the euclidian distance between every pair of observations in a dataset.
Since I am working with very large datasets, I have to use an efficient implementation of pairwise distances computation (both in terms of memory usage and computation time).
One solution is to use the pdist
function from Scipy, which returns the result in a 1D array, without duplicate instances.
However, this function is not able to deal with categorical variables. For these, I want to set the distance to 0 when the values are the same and 1 otherwise.
I have tried to implement this variant in Python with Numba. The function takes as input the 2D Numpy array containing all the observations and a 1D array containing the types of the variables (either float64
or category
).
Here is the code :
import numpy as np
from numba.decorators import autojit
def pairwise(X, types):
m = X.shape[0]
n = X.shape[1]
D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float)
ind = 0
for i in range(m):
for j in range(i+1, m):
d = 0.0
for k in range(n):
if types[k] == 'float64':
tmp = X[i, k] - X[j, k]
d += tmp * tmp
else:
if X[i, k] != X[j, k]:
d += 1.
D[ind] = np.sqrt(d)
ind += 1
return D.reshape(1, -1)[0]
pairwise_numba = autojit(pairwise)
vectors = np.random.rand(20000, 100)
types = np.array(['float64']*100)
dists = pairwise_numba(vectors, types)
This implementation is very slow despite the use of Numba. Is it possible to improve my code to make it faster ?