9

I'm a bit stumped by how scipy.spatial.distance.pdist handles missing (nan) values.

So just in case I messed up the dimensions of my matrix, let's get that out of the way. From the docs:

The points are arranged as m n-dimensional row vectors in the matrix X.

So let's generate three points in 10 dimensional space with missing values:

numpy.random.seed(123456789)
data = numpy.random.rand(3, 10) * 5
data[data < 1.0] = numpy.nan

If I compute the Euclidean distance of these three observations:

pdist(data, "euclidean")

I get:

array([ nan,  nan,  nan])

However, if I filter all the columns with missing values I do get proper distance values:

valid = [i for (i, col) in enumerate(data.T) if ~numpy.isnan(col).any()]
pdist(data[:, valid], "euclidean")

I get:

array([ 3.35518662,  2.35481185,  3.10323893])

This way, I throw away more data than I'd like since I don't need to filter the whole matrix but only the pairs of vectors being compared at a time. Can I make pdist or a similar function perform pairwise masking, somehow?


Edit:

Since my full matrix is rather large, I did some timing tests on the small data set provided here.

1.) The scipy function.

%timeit pdist(data, "euclidean")
10000 loops, best of 3: 24.4 µs per loop

2.) Unfortunately, the solution provided so far is roughly 10 times slower.

%timeit numpy.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()
1000 loops, best of 3: 231 µs per loop

3.) Then I did a test of "pure" Python and was pleasantly surprised:

from scipy.linalg import norm

%%timeit
m = data.shape[0]
dm = numpy.zeros(m * (m - 1) // 2, dtype=float)
mask = numpy.isfinite(data)
k = 0
for i in range(m - 1):
    for j in range(i + 1, m):
        curr = numpy.logical_and(mask[i], mask[j])
        u = data[i][curr]
        v = data[j][curr]
        dm[k] = norm(u - v)
        k += 1
10000 loops, best of 3: 98.9 µs per loop

So I think the way to go forward is to Cythonize the above code in a function.

Midnighter
  • 3,771
  • 2
  • 29
  • 43
  • If you cythonize that, you would probably not look nicer building it around masked arrays directly? Also be careful about extraploating performance of small samples to larger samples... – deinonychusaur Jul 17 '14 at 12:58
  • Sorry for my worst-grammared comment so far. Running on 1000 points in 100 dim space still makes python solution faster (2.65x). – deinonychusaur Jul 17 '14 at 13:31
  • @deinonychusaur What do you mean by "building it around masked arrays directly"? You still need to use the mask supplied to an `ma.array` if you want to hide data, right? Maybe you could edit your answer or we have a little chat. – Midnighter Jul 17 '14 at 14:21
  • Exactly so, but the python-code will look neater. I dug around https://github.com/scipy/scipy/blob/v0.13.0/scipy/spatial/src/distance_wrap.c#L43 too, don't know if that would be of assistance. Could chat, but don't know if I have much more to offer. – deinonychusaur Jul 17 '14 at 14:30
  • @Midnighter, did you get any further speed-up on your code? – Gökhan Sever Apr 13 '16 at 16:17
  • I think another way to speed up this comparison is to use multiprocessing module and let different cores work on different chunk of data. – Gökhan Sever Apr 13 '16 at 16:46

2 Answers2

2

If I understand you correctly, you want the distance for all dimensions that two vector have valid values for.

Unfortunately pdist doesn't understand masked arrays in that sense, so I modified your semi-solution to not reduce information. It is however not the most efficient solution, nor most readable:

np.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()

The outer making it to an array and ravel is just to get it in a matching shape to what you would expect.

itertools.combinations produces all pairwise possible indices of the data-array.

I then just slice data on these (must be a list and not a tuple to slice correctly) and do the pairwise filtering of nan just as your code did.

deinonychusaur
  • 7,094
  • 3
  • 30
  • 44
1

Actually you might be better off with this ready made solution: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html

However the drawback it seems to be that it is more tricky to apply weights when there are missing values

George Pligoropoulos
  • 2,919
  • 3
  • 33
  • 65
  • 1
    That would have been a great solution indeed but it was only added in December 2019 while the question is a few years old by now. I will accept your answer because it is the function I'd use today. – Midnighter Aug 02 '20 at 18:49