2

I have two arrays, array A with ~1M lines and array B with ~400K lines. Each contains, among other things, coordinates of a point. For each point in array A, I need to find how many points in array B are within a certain distance of it. How do I avoid naively comparing everything to everything? Based on its speed at the start, running naively would take 10+ days on my machine. That required nested loops, but the arrays are too large to construct a distance matrix (400G entries!)

I thought the way would be to check only a limited set of B coordinates against each A coordinates. However, I haven't determined an easy way of doing that. That is, what's the easiest/quickest way to make a selection that doesn't require checking all the values in B (which is exactly the same task I'm trying to avoid)?

EDIT: I should've mentioned these aren't 2D (or nD) Cartesian, but spherical surface (lat/long), and distance is great-circle distance.

Community
  • 1
  • 1
Tristan Klassen
  • 395
  • 1
  • 3
  • 9
  • You could sort each record into grid areas and only check the relevant grid and the surrounding ones - It depends if you have a maximum distance? If not, you're going to have to compare them all. – Basic Aug 01 '12 at 18:40
  • Yes, there is a maximum distance (but see edit to question). – Tristan Klassen Aug 01 '12 at 19:40
  • I solved it by splitting into latitude bins and checking each bin against itself and adjacent bins. I'd thought it would be complicated because I'd assumed sorting by longitude first (or only), but sorting only by latitude provided sufficient performance improvement. – Tristan Klassen Aug 09 '12 at 21:50
  • Glad you found a solution. If you don't mind, please post it as an answer and accept it so a) others can find your answer and b) nobody else tries to answer this question when you've solved the problem. Welcome to SO and hope we see you around again :) – Basic Aug 09 '12 at 23:15

2 Answers2

1

I cannot give a full answer right now, but some hints to get you started. It will be much more efficient to organise the points in B in a kd-tree. You can use the class scipy.spatial.KDTree to do this easily, and you can use the query() method on this class to request the points within a given distance.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
0

Here is one possible implementation of the cross match between list of points on the sphere using k-d tree. http://code.google.com/p/astrolibpy/source/browse/my_utils/match_lists.py

Another way is to use healpy module and their get_neighbors method.

sega_sai
  • 8,328
  • 1
  • 29
  • 38