How to generate a new dataframe that is the result of a row by row function on an existing df efficiently

Question

import haversine as hs
from haversine import Unit
import pandas as pd
        
df=pd.DataFrame(data={'id':[1,2,3,4,5],'Latitude':[33.110348,33.500308,30.428149,33.493309,36.433678],'Longitude':[-83.259740,-86.792691,-92.326981,-86.828201,-89.025341]})
            
def getDists(x,y):
   return df.apply(lambda row: hs.haversine([x,y],row[['Latitude','Longitude']],unit=Unit.MILES),axis=1)
            
df.apply(lambda row: getDists(row['Latitude'],row['Longitude']),axis=1)

I have a Dataframe with lat/long values and a unique identifier. I would like to create a new dataframe that would have the distances (using haversine) of every combination of values. I also realized that the distance between x and y is the same as y and x so as long as I sort the values I can eliminate half the calculations.

I can easily do this with a for loop/iterrows, but I was trying to do it as efficiently as possible.

I think I can accomplish this using .apply() and a lambda function to my haversine function, but after thinking about this for hours I've managed only to completely confuse myself.

Could you post your input dataframe? – Emma Oct 05 '21 at 18:55 — Emma, Oct 05 '21 at 18:55

score 1 · Answer 1 · answered Oct 05 '21 at 17:52

1

Are you looking for something along the lines of

df['distance'] = df.apply(lambda r: haversine(r['lat'],r['long']),axis =1)

I've thought about what you're working on before, and I believe my original solution was to use a distance matrix which looped over 1/2 of the matrix---including the diagonal, and placing each value into the corresponding duplicated positions.

answered Oct 05 '21 at 17:52

YoungTim

173
5

Yes, that's what I'm currently up to with the apply/lambda: def getDists(x,y): return df.apply(lambda row: hs.haversine([x,y],row[['Latitude','Longitude']],unit=Unit.MILES),axis=1) df.apply(lambda row: getDists(row['Latitude'],row['Longitude']),axis=1) – Joseph Chillemi Oct 05 '21 at 19:02

Emma · Accepted Answer · 2021-10-05T20:06:00.143

I would first create those non-duplicated exhaustive combinations in Dataframe.

I don't know the terminology, but it is sortof cartesian product minus the (combination of itself and the duplicated combination of 2 locations).

In matrix sense, what you need is upper triangle matrix.

>>> import numpy as np
>>> np.triu(1- np.eye(len(df)))

array([[0., 1., 1., 1., 1.],
       [0., 0., 1., 1., 1.],
       [0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.]])

Using this matrix as Dataframe index, I get Dataframe with the exhaustive list of 2 locations combinations.

>>> i, j = np.where(np.triu(1- np.eye(len(df))))
>>> df = (df.iloc[i].reset_index(drop=True)
          .join(df.iloc[j].reset_index(drop=True), lsuffix='_x', rsuffix='_y'))
>>> df
   id_x  Latitude_x  Longitude_x  id_y  Latitude_y  Longitude_y
0     1   33.110348   -83.259740     2   33.500308   -86.792691
1     1   33.110348   -83.259740     3   30.428149   -92.326981
2     1   33.110348   -83.259740     4   33.493309   -86.828201
3     1   33.110348   -83.259740     5   36.433678   -89.025341
4     2   33.500308   -86.792691     3   30.428149   -92.326981
5     2   33.500308   -86.792691     4   33.493309   -86.828201
6     2   33.500308   -86.792691     5   36.433678   -89.025341
7     3   30.428149   -92.326981     4   33.493309   -86.828201
8     3   30.428149   -92.326981     5   36.433678   -89.025341
9     4   33.493309   -86.828201     5   36.433678   -89.025341

This doesn't include any duplicates of combination, so then, I can calculate the distance for each row.

>>> df['result'] = df.apply(lambda row: hs.haversine([row['Latitude_x'], row['Longitude_x']], [row['Latitude_y'], row['Longitude_y']]), axis=1)

>>> df
   id_x  Latitude_x  Longitude_x  id_y  Latitude_y  Longitude_y      result
0     1   33.110348   -83.259740     2   33.500308   -86.792691  331.157711
1     1   33.110348   -83.259740     3   30.428149   -92.326981  907.184077
2     1   33.110348   -83.259740     4   33.493309   -86.828201  334.342337
3     1   33.110348   -83.259740     5   36.433678   -89.025341  643.134695
4     2   33.500308   -86.792691     3   30.428149   -92.326981  623.748640
5     2   33.500308   -86.792691     4   33.493309   -86.828201    3.383468
6     2   33.500308   -86.792691     5   36.433678   -89.025341  384.390670
7     3   30.428149   -92.326981     4   33.493309   -86.828201  620.539431
8     3   30.428149   -92.326981     5   36.433678   -89.025341  734.575638
9     4   33.493309   -86.828201     5   36.433678   -89.025341  383.356886

If your Dataframe is big, apply is not the best performant method, so if you worry about the performance of the apply, you can try np.vectorize.

%%timeit
df['result'] = df.apply(lambda row: hs.haversine([row['Latitude_x'], row['Longitude_x']], [row['Latitude_y'], row['Longitude_y']]), axis=1)
1.02 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df['result']  = np.vectorize(hs.haversine)(df[['Latitude_x', 'Longitude_x']], df[['Latitude_y', 'Longitude_y']])
430 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

"I would first create those exhaustive combination in Dataframe". When you say this, do you mean I should just run all the combinations? If so, that's what I'm trying to avoid. However, I will definitely look at np.vectorize. If the speed difference is that significant, then this might just be good enough for my use. — Joseph Chillemi, Oct 05 '21 at 19:54
**non-duplicated** exhaustive combination, so it is not running for full list. Please check the df I generated, it only has 9 combinations. Also, I didn't have haversine framework, so I used https://stackoverflow.com/a/4913653/2956135 to calculate the result. You can easily swap the function to the framework's function. — Emma, Oct 05 '21 at 20:12
Oh, my apologies. I need to review your answer further. Thank you. — Joseph Chillemi, Oct 05 '21 at 20:23
Clever solution, thanks. I am going to use your answer with haversine_vector, which Hristo Vrigazov found. — Joseph Chillemi, Oct 06 '21 at 13:03
yup. with `comb=False`. `df['result'] = haversine_vector(df[['Latitude_x', 'Longitude_x']], df[['Latitude_y', 'Longitude_y']], comb=False)` — Emma, Oct 06 '21 at 14:20
@JosephChillemi If you are going to use the answer with vectorization, maybe change the selected answer, so that when users arrive at your question via search, the best answer will be displayed first. — Hristo Vrigazov, Oct 08 '21 at 20:24

score 1 · Answer 3 · answered Oct 05 '21 at 19:58

To achieve the best performance, the distance function that you are computing must be vectorized. It looks like haversine already supports vectorization (see the comb argument), so you should be able to do something like:

from haversine import haversine_vector, Unit
haversine_vector(df, df, Unit.MILES, comb=True)

This is based on reading the haversine documentation, the section for performance improvements.

How to generate a new dataframe that is the result of a row by row function on an existing df efficiently

3 Answers3