Python find distance between geolocation in dataframe having same id

Question

I have a pandas dataframe with duplicate site Ids and longitude and latitude I want to find the distance between the locations of these sites so that if they are away by more than 2 miles I can mark that they are different locations

id	Longitude	Latitude
1	35.624404	34.542616
2	35.637812	34.52873
3	35.433423	34.465716
1	35.439104	34.468755
2	35.512096	34.524426
3	35.512096	34.524426

I would get the duplicates by id, so with the "haversine distance" will filter the elements with a distance smaller than 2m, so you can discard them from the original df. — PeCaDe, Oct 17 '22 at 10:50
I got the duplicate ids only but can't loop on the the same site Ids to calculate the distance, it is OK to have several different sites near each other but the ask is about same sites near each other — Gojoe, Oct 17 '22 at 10:55
Actually thanks I got your point, but this helped 100% https://stackoverflow.com/questions/43577086/pandas-calculate-haversine-distance-within-each-group-of-rows — Gojoe, Oct 17 '22 at 11:22
What is a string of sites, all connected within 2 miles, form a big long line? Are those 1 and the same? — Willem Hendriks, Oct 17 '22 at 14:30

score 1 · Accepted Answer · answered Oct 19 '22 at 10:37

This can be done fully in geopandas

use UTM CRS so that distances are meaningful. Note this will calculate distance in meters, hence conversion factor to miles of 1609.34
so you know which point (first point in group) has been used as reference, the index of this point in original data frame is captured
this solution will work if there are 1, 2 or more points that share the same id

import pandas as pd
import geopandas as gpd

# sample data
df = pd.DataFrame(
    **{
        "columns": ["id", "Longitude", "Latitude"],
        "data": [
            [1, 35.624404, 34.542616],
            [2, 35.637812, 34.52873],
            [3, 35.433423, 34.465716],
            [1, 35.439104, 34.468755],
            [2, 35.512096, 34.524426],
            [3, 35.512096, 34.524426],
        ],
    }
)

gdf = gpd.GeoDataFrame(
    df["id"],
    geometry=gpd.points_from_xy(df["Latitude"], df["Longitude"]),
    crs="epsg:4386",
)
gdf = gdf.to_crs(gdf.estimate_utm_crs())

# for each id, calculate distance in miles from first point
# for good measure capture index of point used to calc distance
gdf = (
    gdf.groupby("id")
    .apply(
        lambda d: d.assign(
            d=d["geometry"].distance(d["geometry"].iat[0]) / 1609.34,
            i=d.index.values[0],
        )
    )
    .to_crs("epsg:4326")
)

gdf

output

	id	geometry	d	i
0	1	POINT (34.542616 35.624404000000006)	0	0
1	2	POINT (34.52873 35.637812)	0	1
2	3	POINT (34.465716 35.433423)	0	2
3	1	POINT (34.468755 35.439104)	13.4336	0
4	2	POINT (34.524426 35.512096)	8.66908	1
5	3	POINT (34.524426 35.512096)	6.35343	2

Python find distance between geolocation in dataframe having same id

1 Answers1

output