0

I'm looking to find the median pairwise squared euclidean distance of an input array.

I know that I could do it using pdist:

from scipy.spatial.distance import pdist
dists = pdist(inp,'sqeuclidean')
np.median(dists)

But I need to repeat the process many times on very large input arrays. I'm doubtful that this could ever be extremely fast, as pairwise distance calculation is O(n^2) -- unless there's some clever way around that. But I know that the code for pdist isn't particularly efficient (source code is nested for-loops), so I was wondering if there's a way to speed it up, especially considering I just need the median.

Doe a
  • 352
  • 1
  • 11
  • Just trying to understand -- is pairwise operations `O{n^3}` or `O{n^2}`? For each element in `S`, pairwise with each element in `S`. What am I missing? – Razzle Shazl Mar 08 '21 at 18:20
  • Ah, I think you're right. I'll edit my question accordingly. I'm still trying to speed it up if possible, perhaps through a different implementation of the pairwise distance? I have tried using [this function](https://stackoverflow.com/questions/47566072/is-it-possible-to-speed-up-this-loop-in-python/47568504#47568504) to circumvent pdist's loops, but then I have to compute the median over an 2D array, which isn't any faster. – Doe a Mar 08 '21 at 20:03
  • You might achieve gains by exploiting data locality in your code (C let's you allocate along page boundaries, not sure what python let's you do). Combine that with compatible access pattern (e.g. sequential) and you reduce cache miss which is typically a big gain. – Razzle Shazl Mar 08 '21 at 22:42

0 Answers0