-2

I have a dataframe of points with its id-name and latitude/longitude:

df = pd.DataFrame({'id':list('abcde'),'latitude': [38.470628, 37.994155, 38.66937, 34.119578, 36.292307],'longitude': [-121.404586, -121.802341, -121.295325, -117.413791, -119.804074]})  #sample

For each id I need to count the number of points (of the same dataset) that are located within a radius of 2 miles from it.

Question: how to do this in the simplest way in Python?

Dima
  • 47
  • 6

2 Answers2

1

The question is somewhat ambiguous. The first component you need is a function to calculate distance between two coordinates, this requires some trigonometry and has several implementations in the following questions.

After you have the function simply loop over all points and calculate. There might be more efficient ways than two nested loop but this is the simplest.

Gadi
  • 76
  • 5
  • Thanks for your reply! I would like to have an easier way, because in fact the frame consists of about 1 million id – Dima Feb 23 '22 at 13:10
  • I am not very familiar with spatial algorithms, but there are some steps to make it more efficient even if we still remains in the (n^2) complexity. If the points are close enough the distance calculation can be simplifies significantly. Another approach will be to find some clusters of points close to each other and calculate only over them. Divide you area into two miles square, put each point in the proper place and check just between all points in adjacent squares. It will still be N^2 but you will eliminate millions of tests – Gadi Feb 23 '22 at 13:17
1
import numpy as np
import pandas as pd
from sklearn.neighbors import BallTree

Sample Data

df = pd.DataFrame({'id':list('abcde'),'latitude': [38.470628, 37.994155, 38.66937, 34.119578, 36.292307],'longitude': [-121.404586, -121.802341, -121.295325, -117.413791, -119.804074]})  #sample

Extract lat,long and convert to radians. Calculate the needed radius when converted to unit sphere.

coords = df[["latitude","longitude"]]

distance_in_miles = 50
earth_radius_in_miles = 3958.8

radius = distance_in_miles / earth_radius_in_miles
tree = BallTree( np.radians(coords), leaf_size=10, metric='haversine')

tree.query_radius( np.radians(coords), r=radius, count_only=True)

Which gives array([3, 2, 2, 1, 1])


If you want to return the indici and use them for aggregates; one way is to

df = pd.DataFrame({'id':list('abcde'),'latitude': [38.470628, 37.994155, 38.66937, 34.119578, 36.292307],'longitude': [-121.404586, -121.802341, -121.295325, -117.413791, -119.804074], 'saleprice_usd_per_sqf': [200, 300, 700, 350, 50]})
coords = df[["latitude","longitude"]]

distance_in_miles = 50
earth_radius_in_miles = 3958.8

radius = distance_in_miles / earth_radius_in_miles

Note we use indici here and not only count;

tree = BallTree( np.radians(coords), leaf_size=10, metric='haversine')
indici = tree.query_radius( np.radians(coords), r=radius, count_only=False)

And use list comprehension to for instance get the median value for each radius. Be aware the the point itself is always included in its own radius.

[np.median(df.saleprice_usd_per_sqf.values[idx]) for idx in indici]
Willem Hendriks
  • 1,267
  • 2
  • 9
  • 15
  • Willem, thanks for such an elegant solution! Do I understand correctly that this script will allow me to calculate: how many dataset objects are in a given radius from the first, second, and so on object? (I specify, because I do not have the same number of them in the test and calculated versions) And I have another question: if there is such feature in the dataset as the saleprice of an object, how can I use the script you proposed to calculate the median cost of objects in a given radius for 1, 2, 3, .... n-object? – Dima Feb 25 '22 at 10:03
  • Yes, this calculates for each object the number of objects within; if it is 1, it is only itself. I use radius = 50 miles here. So the first 3 indicates besides the object itself, 2 other are within radius of 50 miles. – Willem Hendriks Feb 25 '22 at 10:17
  • For the question about median costs, add this to the data. A Balltree can return the indici for each point, which can be used to calculate this. In the above example "count_only=True" so won't be possible here – Willem Hendriks Feb 25 '22 at 10:19
  • Ok, I understand, thanks a lot!! As example I mean: `df = pd.DataFrame({'id':list('abcde'),'latitude': [38.470628, 37.994155, 38.66937, 34.119578, 36.292307],'longitude': [-121.404586, -121.802341, -121.295325, -117.413791, -119.804074], 'saleprice_usd_per_sqf': [200, 300, 700, 350, 50]})` – Dima Feb 25 '22 at 10:40
  • added to answer. Make sure to be as complete as possible in next question on stack. – Willem Hendriks Feb 25 '22 at 11:09
  • Willem, I am very grateful to you for your help!! – Dima Feb 25 '22 at 11:31
  • Willem, could you please help me with one more episode: let's say some flag is added in our set and it takes the form: df = pd.DataFrame({'id':list('abcde'),'latitude': [38.470628, 37.994155, 38.66937, 34.119578, 36.292307],'longitude': [-121.404586, -121.802341, -121.295325, -1717.41, -1717.41 , -119.804074], "flag": [1,1,0,0,0], 'saleprice_usd_per_sqf': [200, 300, 700, 350, 50]}) How could I take this flag into account when calculating the median? That is, for example, it is necessary to calculate the median of the saleprice of only those objects in which the flag value is 0 – Dima Feb 28 '22 at 11:22
  • This boils down to basic python/pandas usage. – Willem Hendriks Feb 28 '22 at 12:50