0

I am trying to convert a function to a more optimized way in order to reduce the computation time. More specifically, I want to keep the minimum haversine distance for each point of plot1 compared to all points of plot2 (where plot1, plot2 dataframes with latitude, longitude columns). Here is my code:

def calculate_min_haversine_distance(plot1, plot2):

    for index,row in plot1.iterrows():
        minimum = 100000000
        for index2, row2 in plot2.iterrows():
            dis = haversine_distance(row.latitude, row.longitude, row2.latitude, row2.longitude) 
            if (dis<minimum):
                minimum=dis
        plot1.loc(index,'Min Haversine Distance') = minimum

    return plot1
ann
  • 1
  • 3
  • Take a loo at https://stackoverflow.com/questions/1233448/no-multiline-lambda-in-python-why-not – Yeheshuah Nov 12 '19 at 08:32
  • "I am trying to convert a function to a lambda expression, in order to minimize the computational time. " Hold on. Why do you think converting this to a lambda expression will improve the runtime? There is nothing special about lambda functions *except for the fact that they are anonymous*, otherwise, they are just like any other function. – juanpa.arrivillaga Nov 12 '19 at 08:43
  • @juanpa.arrivillaga *lambda expressions* do have runtime improvments in comparison to regular loops, even if they are not the best option if vectorization or native functions are possible. – Aryerez Nov 12 '19 at 08:52
  • @Aryerez no, they absolutely do not. A lambda is **just a normal function** that is **anonymous**. – juanpa.arrivillaga Nov 12 '19 at 08:53
  • @juanpa.arrivillaga I am talking about using a *lambda* in `pandas apply` instead of doing a loop. – Aryerez Nov 12 '19 at 08:56
  • @juanpa.arrivillaga maybe you're right, I didn't know. I have to paraphrase my question. – ann Nov 12 '19 at 09:19

2 Answers2

0

I'm not sure how to get rid of the first loop, but this should help you get rid of the second:

def calculate_min_haversine_distance(plot1, plot2):
    for index,row in plot1.iterrows():
        plot2['dist'] = plot2.apply(lambda x: haversine_distance(row.latitude, row.longitude, x.latitude, x.longitude), axis=1)
        plot1.loc[index,'Min Haversine Distance'] = min(plot2['dist'])
    plot2.drop('dist', axis=1, inplace=True) # Delete the temporary column created
    return plot1
Aryerez
  • 3,417
  • 2
  • 9
  • 17
  • thank you for your answer. Indeed this is faster than my function but I need to minimize much more computational time. Any idea is welcome! – ann Nov 12 '19 at 09:13
  • 1
    @theo Check NullByte answer, with my second comment (if he haven't edited it yet). It may be faster. – Aryerez Nov 12 '19 at 09:18
0

I would try to do something like this: I hope it helps.

import pandas as pd
import numpy as np


df1 = pd.DataFrame(data={'lat': [1,2,3,4], 'lon': [5,6,7,8]})
df2 = pd.DataFrame(data={'lat': [9,10,11,12], 'lon': [13,14,15,16]})
df1['key'], df2['key'] = 1,1

df_c = pd.merge(df1, df2, on='key').drop('key', axis=1)

# below function is copied from: https://stackoverflow.com/a/43577275/4450090
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

df_c['dist'] = df_c.apply(lambda x: haversine(x['lat_x'], x['lon_x'], x['lat_y'], x['lon_y']), axis=1)
min_val = 1000000
df_c['dist'] = df_c['dist'].apply(lambda x: x if x < min_val else min_val)
Dariusz Krynicki
  • 2,544
  • 1
  • 22
  • 47
  • 1
    You may be in the direction until the last two lines, but what he wants is to find for each row in `plot1` the minimun distance from all the rows in `plot2`. You are producing all the distances martix. – Aryerez Nov 12 '19 at 09:01
  • 1
    Use this instead of your last 2 rows: `df_c = df_c.groupby(['lat_x', 'lon_x']).min()['dist'].reset_index()` – Aryerez Nov 12 '19 at 09:17
  • @NullByte I don't understand exactly the reason but my pc crashes when I run this piece of code. I think it's happening when it merges the dataframes. Nevertheless, the dataframes aren't the same shape (so I think .merge won't work) and also I want to calculate the distance from each point of plot1 with all of the plot2 and take the minimum. – ann Nov 12 '19 at 10:09
  • It may be due to the dataframes shape. – Dariusz Krynicki Nov 12 '19 at 10:13