3

I have a large, symmetric, 2D distance array. I want to get closest N pairs of observations.

The array is stored as a numpy condensed array, and has of the order of 100 million observations.

Here's an example to get the 100 closest distances on a smaller array (~500k observations), but it's a lot slower than I would like.

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
    # converts an index in a condensed array to the 
    # pair of observations it represents
    # modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
    ti = np.triu_indices(n, 1)
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
    pair = condensed_to_square_index(n, i)
    r.append(pair)

It seems to me like there must be quicker ways to do this with standard numpy or scipy functions, but I'm stumped.

NB If lots of pairs are equidistant, that's OK and I don't care about their ordering in that case.

roblanf
  • 1,741
  • 3
  • 18
  • 24
  • You can at least speed up the sort by using partial sorting. It would be only like six times faster, at most. – leewz Dec 12 '13 at 10:41
  • @leewangzhong, thanks. But unfortunately it's not the sorting that's the rate limiting step. It's converting the list of indices back into pairs of observations. – roblanf Dec 12 '13 at 10:44
  • related: [Millions of 3D points: How to find the 10 of them closest to a given point?](http://stackoverflow.com/q/2486093/4279) – jfs Dec 12 '13 at 15:10
  • Does this answer your question? [Efficient way to take the minimum/maximum n values and indices from a matrix using Python and NumPy](https://stackoverflow.com/questions/5807047/efficient-way-to-take-the-minimum-maximum-n-values-and-indices-from-a-matrix-usi) – EliadL Jan 09 '20 at 13:05

4 Answers4

5

You don't need to calculate ti in each call to condensed_to_square_index. Here's a basic modification that calculates it only once:

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)

for i in closest:
    pair = condensed_to_square_index(ti, i)
    r.append(pair)

You can also vectorize the creation of r:

r  = zip(ti[0][closest] + 1, ti[1][closest] + 1)

or

r = np.vstack(ti)[:, closest] + 1
YXD
  • 31,741
  • 15
  • 75
  • 115
  • 2
    Legend. Thanks. I should have noticed the time saving on calculating ti. Some timing on my laptop: My slow method: 2.94s per loop Your first method: 74.2 ms per loop With zip for r: 70.7 ms per loop With vstack for r: 70.5 ms per loop – roblanf Dec 12 '13 at 11:32
  • Great :) I didn't try it out with 100 million observations but hopefully it does the trick. – YXD Dec 12 '13 at 11:36
2

You can speed up the location of the minimum values very notably if you are using numpy 1.8 using np.partition:

def smallest_n(a, n):
    return np.sort(np.partition(a, n)[:n])

def argsmallest_n(a, n):
    ret = np.argpartition(a, n)[:n]
    b = np.take(a, ret)
    return np.take(ret, np.argsort(b))

dists = np.random.rand(1000*999//2) # a pdist array

In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True

In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop

In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop

And once you have the smallest indices, you don't need a loop to extract the indices, do it in a single shot:

closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
                         np.take(tu[1], closest))) + 1
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • That's awesome. Looks like you're missing a single closing bracket before the +1 at the end. Also, (not that this matters relative to the other speedups), my timings suggest that using @mr-e 's zip method is marginally quicker than using np.column_stack to make the list of pairs when it's 100 units long. Though I haven't checked how each one scales up. – roblanf Dec 12 '13 at 22:13
0

The best solution probably won't generate all of the distances.

Proposal:

  1. Make a heap of max size 100 (if it grows bigger, reduce it).
  2. Use the Closest Pair algorithm to find the closest pair.
  3. Add the pair to the heap (priority queue).
  4. Choose one of that pair. Add its 99 closest neighbors to the heap.
  5. Remove the chosen point from the list.
  6. Find the next closest pair and repeat. The number of neighbors added is 100 minus the number of times you ran the Closest Pair algorithm.
leewz
  • 3,201
  • 1
  • 18
  • 38
0

You can use pandas DataFrame. First you declare similarity matrix (use for instance pairwise_distances() from sklearn) as DataFrame, add column and index names from the source data. Then you select any column by name (it is your column of interest), then use pandas.DataFrame.sort_values(), then select top 5 or top 10. Thats's it.

Leo
  • 199
  • 2
  • 4