2

I have two dataframes of coordinates archivo1 and archivo2, I need the closest point (id) of the second dataframe in the first one. Until now my code is:

import pandas as pd
import numpy as np

def getDistance(archivo1,lat,log):
  R = 6371 
  archivo1['dLat'] =(lat-archivo1['lat']).apply(deg2rad)/2
  archivo1['dLon'] =(log-archivo1['log']).apply(deg2rad)/2
  archivo1['a']=(archivo1['dLat'].apply(math.sin))*(archivo1['dLat'].apply(math.sin))+(archivo1['lat'].apply(deg2rad).apply(math.cos))*(math.cos(deg2rad(lat)))*(archivo1['dLon'].apply(math.sin))*(archivo1['dLon'].apply(math.sin))
  archivo1['b']=  archivo1['a'].apply(math.sqrt)/(1-archivo1['a'].apply(math.sqrt))
  archivo1['Distancia']=R*2*archivo1['b'].apply(math.atan)

def deg2rad(deg):
    return deg * (math.pi/180)

for i in range(len(archivo1)):
    getDistance(archivo1,archivo2['lat'].iloc[i],archivo2['long'].iloc[i])
    archivo1['id'].iloc[i]=str(archivo2[archivo2['Distancia']==archivo2['Distancia'].min()]['id'].iloc[0])

The code runs and give me the expected results, nevertheless the first file have 7 millions and the first file 70k so it takes 7 days running. Could anyone help me to optimize it?

This is the sample of the two files:

This is the file 2 to look up:

File 2:
id longitude latitude                  
L10F10P1    -72.61521393    8.290479554
L10F10P10   -72.61517542    8.290583772
L10F10P100  -72.61481425    8.290812192
L10F10P101  -72.61484522    8.290877898
L10F10P102  -72.61488579    8.290968212
L10F10P103  -72.61492075    8.291033898
L10F10P104  -72.61495586    8.291095669
L10F10P105  -72.61499304    8.291166076
L10F10P106  -72.61503357    8.291235121
L10F10P107  -72.61508271    8.291330912
L10F10P108  -72.61516194    8.291456605
L10F10P109  -72.61519939    8.291548893
L10F10P11   -72.61522969    8.290676982
L10F10P110  -72.61522794    8.291592503
[76701 rows x 9 columns]
File 1:
latitude longitude 
8.318648471 -72.6132329
8.318648678 -72.6134567
8.318648971 -72.6133456
8.318678421 -72.6138765
8.319765345 -72.6137658
[6877229 rows x 10 columns]

1 Answers1

1

Without an example I will not write an exact code but suggest improvements line by line. The general idea is that apply is generally quite slow, as it is essentially a loop behind the scenes.


This is certainly slow:

archivo1['dLat'] = (lat-archivo1['lat']).apply(deg2rad)/2

This will be better:

archivo1['dLat'] = (lat-archivo1['lat']) * math.pi/180/2

Using numpy functions rather than apply math functions should also be faster:

np.sin(archivo1['dLat'].values)

instead of

archivo1['dLat'].apply(math.sin)

The values property gives you access to the underlying numpy array. Similarly, use np.sqrt.


Then use np.multiply repeatedly on the numpy arrays calculated above to multiply them element-wise. You can assign the final array back to the dataframe's column Distancia.


The for loop could be improved by defining a function containing the two lines inside the loop, and using apply to apply it to each row in the dataframe.


Finally, using argmin or idxmin should be faster than:

archivo2[archivo2['Distancia']==archivo2['Distancia'].min()]

By putting all of the above together, you should see quite an improvement already!

IanS
  • 15,771
  • 9
  • 60
  • 84
  • 1
    In the 100 records sample it reduces the time from 47 seconds to 21. Thanks a lot – Abel Buriticá Aug 26 '16 at 20:02
  • If you're still looking for improvement you should ask a new question again (with the new code). It should receive more attention than on a Friday afternoon, and maybe someone will have an entirely new idea, as opposed to the gradual improvements I suggested. In the meantime you can accept my answer, you will gain 2 reputation ;) – IanS Aug 29 '16 at 08:17