On this answer one will find two potential options:
Option 1, using a function that I created on my answer here. On that answer one will find additional methods that one could use.
Option 2, using a different function.
For testing purposes, even though I recommend testing with data as close to what one will be using as possible, I will take the example proposed by @Qdr
import pandas as pd
import numpy as np
import random as rn
data = [[rn.randint(1, 10), rn.randint(1, 10)] for x in range(9)]
users = ['user1', 'user2', 'user3'] * 3
rn.shuffle(users)
df1 = pd.DataFrame(data, columns=['x', 'y'], index=users)
Option 1
In order to measure the distance between two points (represented by geographic coordinates), as I referred above, one can use one of the function I shared here, where we will find a better explanation.
The function is called haversine
, and is inspired by the haversine formula.
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great-circle distance (in km) between two points
using their longitude and latitude (in degrees).
"""
# Radius of the Earth
r = 6371.0
# Convert degrees to radians
# First point
lat1 = radians(lat1)
lon1 = radians(lon1)
# Second Point
lat2 = radians(lat2)
lon2 = radians(lon2)
# Haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return r * c
As one wants the max distance, let's create a function that uses the previous one
def max_distance(lat1, lon1, lat2, lon2):
# Calculate distance between two points
distance = haversine(lon1, lat1, lon2, lat2)
# Return max distance
return np.max(distance)
Finally, one can create a new dataframe, df2
.
[In]: df2 = df1.groupby(df1.index).apply(lambda x: pd.Series({'max_distance': max_distance(x['x'].iloc[0], x['y'].iloc[0], x['x'].iloc[1], x['y'].iloc[1])}))
[Out]: max_distance
user1 866.714728
user2 867.428750
user3 247.358878
Option 2
Depending on one's requirements, the following function can also be used to , assuming one wants to calculate the max distance between two points, the following function does the work
def max_distance(lat1, lon1, lat2, lon2):
# Calculate distance between two points
distance = np.sqrt((lat1 - lat2)**2 + (lon1 - lon2)**2)
# Return max distance
return np.max(distance)
In order to create the new dataframe, grouped by users (in this example it is the index of the dataframe df1
), with a column named max_dist_km
that will have the max distance between two points for a given user (using the previous function), the following should do the work
df2 = df1.groupby(df1.index).apply(lambda x: pd.Series({'max_distance': max_distance(x['x'].iloc[0], x['y'].iloc[0], x['x'].iloc[1], x['y'].iloc[1])}))