Pandas: finding points within maximum distance

Question

I am trying to find pairs of (x,y) points within a maximum distance of each other. I thought the simplest thing to do would be to generate a DataFrame and go through each point, one by one, calculating if there are points with coordinates (x,y) within distance r of the given point (x_0, y_0). Then, divide the total number of discovered pairs by 2.

%pylab inline
import pandas as pd

def find_nbrs(low, high, num, max_d):
    x = random.uniform(low, high, num)
    y = random.uniform(low, high, num)
    points = pd.DataFrame({'x':x, 'y':y})

    tot_nbrs = 0

    for i in arange(len(points)):
        x_0 = points.x[i]
        y_0 = points.y[i]

        pt_nbrz = points[((x_0 - points.x)**2 + (y_0 - points.y)**2) < max_d**2]
        tot_nbrs += len(pt_nbrz)
        plot (pt_nbrz.x, pt_nbrz.y, 'r-')

    plot (points.x, points.y, 'b.')
    return tot_nbrs

print find_nbrs(0, 1, 50, 0.1)

First of all, it's not always finding the right pairs (I see points that are within the stated distance that are not labeled).
If I write plot(..., 'or'), it highlights all the points. Which means that pt_nbrz = points[((x_0 - points.x)**2 + (y_0 - points.y)**2) < max_d**2] returns at least one (x,y). Why? Shouldn't it return an empty array if the comparison is False?
How do I do all of the above more elegantly in Pandas? For example, without having to loop through each element.

Correct me if I'm wrong but you're doing an O(n) search when what I think you want is an O(n^2) search. You're basically checking distance between x0:y0, x1:y1, x2:y2... when what I think you want to do is check x0:y0, x0:y1, ... x1:y0, x1:y1, x1:y2 .... — Greg, Nov 09 '14 at 07:21
But if I'm wrong about what you want then this would work well for you http://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy — Greg, Nov 09 '14 at 07:25
Thanks for the link. I am having some trouble figuring out how to use numpy.linalg.norm to calculate the distance, despite the answer. What format should a and b be in, in the example? Re: O(n^2), I thought that's what I was doing: i.e., going through each dataframe element and finding all other the elements that satisfy the comparison. That should identify all the twins, twice, so to get the number, I just divide the finall tally by 2. — Anarcho-Chossid, Nov 09 '14 at 18:25
Actually, I just realized the answer to question 2: obviously, Pandas compares the distance of the point to itself (which is zero and smaller than max_d), which is why each point is "labeled" as a twin. So, that's another question: how do I perform a comparison that doesn't include comparing the point to itself? — Anarcho-Chossid, Nov 09 '14 at 18:29

score 8 · Answer 1 · answered Nov 11 '14 at 17:50

The functionality you're looking for is included in scipy's spatial distance module.

Here's an example of how you could use it. The real magic is in squareform(pdist(points)).

from scipy.spatial.distance import pdist, squareform
import numpy as np
import matplotlib.pyplot as plt

points = np.random.uniform(-.5, .5, (1000,2))

# Compute the distance between each different pair of points in X with pdist.
# Then, just for ease of working, convert to a typical symmetric distance matrix
# with squareform.
dists = squareform(pdist(points))

poi = points[4] # point of interest
dist_min = .1
close_points = dists[4] < dist_min

print("There are {} other points within a distance of {} from the point "
    "({:.3f}, {:.3f})".format(close_points.sum() - 1, dist_min, *poi))

There are 27 other points within a distance of 0.1 from the point (0.194, 0.160)

For visualization purposes:

f,ax = plt.subplots(subplot_kw=
    dict(aspect='equal', xlim=(-.5, .5), ylim=(-.5, .5)))
ax.plot(points[:,0], points[:,1], 'b+ ')
ax.plot(poi[0], poi[1], ms=15, marker='s', mfc='none', mec='g')
ax.plot(points[close_points,0], points[close_points,1],
    marker='o', mfc='none', mec='r', ls='')  # draw all points within distance

t = np.linspace(0, 2*np.pi, 512)
circle = dist_min*np.vstack([np.cos(t), np.sin(t)]).T
ax.plot((circle+poi)[:,0], (circle+poi)[:,1], 'k:') # Add a visual check for that distance
plt.show()

enter image description here

Pandas: finding points within maximum distance

1 Answers1