Python - Compare dataframe rows with each other

Question

I have a dataframe with three columns: id, latitude and longitude. For each row, I need to find the rows with a distance lower than some fixed values.

The solution I'm using is a double for loop, and I'm looking for more efficient implementations.

Here's my current code:

import pandas as pd
def distance(coord1,coord2):
   ...
   return float_distance_in_km
df=pd.read_csv("coordinates.csv",na_values=None)

lessThan1=list()
lessThan5=list()
lessThan10=list()
lessThan50=list()
for i in range(0,len(df)):
   lessThan1_row=list()
   lessThan5_row=list()
   lessThan10_row=list()
   lessThan50_row=list()
   if  df['longitude'][i] is not None and df['latitude'][i] is not None:
       coords_1=(df['longitude'][i],df['latitude'][i])
       for j in range(0,len(df)):
           if i==j:
               continue
           if df['longitude'][j] is None or  df['latitude'][j] is None:
               continue
           coords_2=(df['longitude'][j],df['latitude'][j])
           dist=distance(coords_1, coords_2)
           neighbor=df['id'][j]
           if dist<1:
               lessThan1_row.append(neighbor)
           elif dist<5:
               lessThan5_row.append(neighbor)
           elif dist<10:
               lessThan10_row.append(neighbor)
           elif dist<50:
               lessThan50_row.append(neighbor)
   lessThan1.append(lessThan1_row)
   lessThan5.append(lessThan5_row)
   lessThan10.append(lessThan10_row)
   lessThan50.append(lessThan50_row)
df["1km"]=lessThan1
df["5km"]=lessThan5
df["10km"]=lessThan10
df["50km"]=lessThan50

The dataframe output is not mandatory, I just happen to have the dataset loaded as dataframe.

"More efficient" is not clear. please provide specific metrics to use as "more efficient". Also, this might be a question for codereview stack, not SO. — Itération 122442, Aug 22 '23 at 13:42
The simplest way is to use a *spatial join* in `geopandas`. You simply need to create point geometries from the coordinates. — Herbert, Aug 22 '23 at 13:43
@Itération122442 Obviously either a vectorized implementation or something with a decent library rather than for-loops is preferred in this case, efficiency-wise. — Herbert, Aug 22 '23 at 13:46
also if you do the comparisons the way you do them suggest building one list for <1, one for <5 etc... via a that if else chain putting only one list append in each case, and then at the end you can add the list of <1 to all the other lists and the <5 to all the lists of greater distances and so forth. — UpAndAdam, Aug 22 '23 at 16:41
why are you always appending it to the lessthan1_row regardless of the condition at the `if dist<1` area? second your statement problem doesnt make any sense in one regard... you have a table of `Points` wouldn't you want to find a list of pair's of `Points` with distances less than certain values. and thus instead of iterating over the list x^2 iterate x(x-1)/2 by iterating from 0 to max outer loop and i to max in innerloop, no need to do double comparison for a pair. the distance of `Point A` to `Point X` is the same as `Point X` to `Point A` and its the same pair of `Point`'s. — UpAndAdam, Aug 22 '23 at 16:42
Does this answer your question? [KDTree for longitude/latitude](https://stackoverflow.com/questions/10549402/kdtree-for-longitude-latitude) — Vitalizzare, Aug 22 '23 at 18:36
not a problem. hope my idea was mildly helpful. its not technically an order of magnitude faster as its still x^2 but its still a mild improvement except i get the sense you need the full listing for data consistency. — UpAndAdam, Aug 23 '23 at 13:52

score 1 · Answer 1 · answered Aug 22 '23 at 14:04

1

One way to do it is to use geopandas like this:

import geopandas as gpd
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(200,2)*2+[[50, 6]], columns=['latitude', 'longitude'])
data = gpd.GeoDataFrame(
    data[[]], geometry=gpd.points_from_xy(data.longitude, data.latitude), crs="EPSG:4326")

distances = data['geometry'].apply(lambda x: data.distance(x)).melt(ignore_index=False).reset_index().rename({'index': 'from', 'variable': 'to', 'value': 'distance'}, axis=1)
distances[distances['distance'] < 1]

It saves the index values in the from and to columns where the distances is smaller than 1 meter. The unit of meters isn't in the geopandas documentation though.

answered Aug 22 '23 at 14:04

Herbert

5,279
5
44
69

This is very interesting and might be the answer I'm looking for. The unit is definitely not meters: for example, it says that the distance between (36.7297026,-76.582792) and (37.288597,-121.930556) is 45.351208, but it should be 3,987km – AlbertoD Aug 22 '23 at 20:59
@AlbertoD I couldn't find the unit in de documentation, maybe it's just the norm, in which case this is useless. – Herbert Aug 23 '23 at 11:02
@AlbertoD and please note the quadratic complexity in memory, if this is an issue, filter in the `lambda` beforehand, e.g. by the largest radius. – Herbert Aug 23 '23 at 11:03

Python - Compare dataframe rows with each other

1 Answers1