Optimizing/Parallel Computing a simple but big loop based on Pandas

Question

I have this simple loop which processes a big dataset.

for i in range (len(nbi.LONG_017)):
    StoredOCC = []
    for j in range (len(Final.X)):
        r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
        if (r < 0.03048):
            SWw = Final.CUM_OCC[j]
            StoredOCC.append(SWw) 
    if len(StoredOCC) != 0:
        nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)

len(nbi.LONG_017) is 3000 and len(Final.X) is 6 millions data points.

I was wondering if there is an efficient way to implement this code or use parallel computing to make it faster?

I used the code provided here : Haversine Formula in Python (Bearing and Distance between two GPS points) for my function haversine:

def haversine(lon1, lat1, lon2, lat2):
   """
   Calculate the great circle distance between two points 
   on the earth (specified in decimal degrees)
   """
   # convert decimal degrees to radians 
   lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

   # haversine formula 
   dlon = lon2 - lon1 
   dlat = lat2 - lat1 
   a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
   c = 2 * asin(sqrt(a)) 
   r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles
   return r * c

I see two sets of LAT,LONG pairs. What is it what you want to calculate? It is radius related right? — Willem Hendriks, Mar 25 '21 at 21:11

Ben.T · Accepted Answer · 2021-03-25T20:57:29.227

Before talking about parallelization, you can work on optimizing your loop. The first way would be to iterate over the data instead of incremental values over the length and then access the data each time:

#toy sample
np.random.seed(1)
size_nbi = 20
size_Final = 100

nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/100+73, 
                    'LAT_016':np.random.random(size=size_nbi)/100+73,})

Final = pd.DataFrame({'X':np.random.random(size=size_Final)/100+73,
                      'Y':np.random.random(size=size_Final)/100+73, 
                      'CUM_OCC':np.random.randint(size_Final,size=size_Final)})

Using your method, you get for these size of dataframes about 75ms:

%%timeit
for i in range (len(nbi.LONG_017)):
    StoredOCC = []
    for j in range (len(Final.X)):
        r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
        if (r < 0.03048):
            SWw = Final.CUM_OCC[j]
            StoredOCC.append(SWw) 
    if len(StoredOCC) != 0:
        nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)
# 75.6 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now if you change slightly the loops, iterate over the data themselves and use a list for the result then put the result as a column outside the loops, you can go down to 5ms so 15 times faster, by just optimizing it (and you might find even better optimization):

%%timeit
res_ADIT = []
for lon1, lat1 in zip (nbi.LONG_017.to_numpy(),
                       nbi.LAT_016.to_numpy()):
    StoredOCC = []
    for lon2, lat2, SWw in zip(Final.X.to_numpy(), 
                               Final.Y.to_numpy(), 
                               Final.CUM_OCC.to_numpy()):
        r = haversine(lon1, lat1, lon2, lat2)
        if (r < 0.03048):
            StoredOCC.append(SWw) 
    if len(StoredOCC) != 0:
        res_ADIT.append(np.max(StoredOCC))
    else:
        res_ADIT.append(np.nan)
nbi['ADIT_v2'] = res_ADIT
# 5.23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now, you can even go further and use numpy and vectorization of your second loop, and also pass directly the radians values instead of having to map them at each loop:

# empty list for result
res_ADIT = []

# work with array in radians
arr_lon2 = np.radians(Final.X.to_numpy())
arr_lat2 = np.radians(Final.Y.to_numpy())
arr_OCC = Final.CUM_OCC.to_numpy()

for lon1, lat1 in zip (np.radians(nbi.LONG_017), #pass directly the radians
                       np.radians(nbi.LAT_016)):
    
    # do all the substraction in a vectorize way
    arr_dlon = arr_lon2 - lon1
    arr_dlat = arr_lat2 - lat1 

    # same here using numpy functions
    arr_dist = np.sin(arr_dlat/2)**2 + np.cos(lat1) * np.cos(arr_lat2) * np.sin(arr_dlon/2)**2
    arr_dist = 2 * np.arcsin(np.sqrt(arr_dist)) 
    arr_dist *= 6372.8

    # extract the values of CUM_OCC that meet the criteria
    r = arr_OCC[arr_dist<0.03048]
    
    # check that at least one element
    if r.size>0:
        res_ADIT.append(max(r))
    else :
        res_ADIT.append(np.nan)

nbi['AUDIT_np'] = res_ADIT

If you do a timeit of this, you go down to 819 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) so about 90 time faster than you original solution for the small size of dataframes and all the results are the same

print(nbi)
     LONG_017    LAT_016  ADTT_02_2019  ADIT_v2  AUDIT_np
0   73.004170  73.008007           NaN      NaN       NaN
1   73.007203  73.009683          30.0     30.0      30.0
2   73.000001  73.003134          14.0     14.0      14.0
3   73.003023  73.006923          82.0     82.0      82.0
4   73.001468  73.008764           NaN      NaN       NaN
5   73.000923  73.008946           NaN      NaN       NaN
6   73.001863  73.000850           NaN      NaN       NaN
7   73.003456  73.000391           NaN      NaN       NaN
8   73.003968  73.001698          21.0     21.0      21.0
9   73.005388  73.008781           NaN      NaN       NaN
10  73.004192  73.000983          93.0     93.0      93.0
11  73.006852  73.004211           NaN      NaN       NaN

You can play a bit with the code and increase the size of each toy dataframe, and you'll see how by increasing the size especially of the data Final, the gain becomes interesting for example with size_nbi = 20 and size_Final = 1000, you get a gain of 400 time faster with the vectorized solution. and so on. For your full data size (3K *6M), you still need a bit of time and on my computer it would take about 25 minutes I estimate while your solution is in hundred of hours. Now if this is not enough, you can think about parallelization or also using numba

score 2 · Answer 2 · answered Mar 25 '21 at 21:57

This method using Balltree will take about a minute on my machine (depending on the radius, which the smaller the faster), with the sizes you mentions (7000 and 6M)

import numpy as np
import sklearn
import pandas as pd

I generate data, thx Ben i used your code

#toy sample
np.random.seed(1)
size_nbi = 7000
size_Final = 6000000

nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/10+73, 
                    'LAT_016':np.random.random(size=size_nbi)/10+73,})

Final = pd.DataFrame({'X':np.random.random(size=size_Final)/10+73,
                      'Y':np.random.random(size=size_Final)/10+73, 
                      'CUM_OCC':np.random.randint(size_Final,size=size_Final)})

nbi_gps = nbi[["LAT_016", "LONG_017"]].values
final_gps = Final[["Y", "X"]].values

Create a Balltree

%%time

from sklearn.neighbors import BallTree
import numpy as np

nbi =  np.radians(nbi_gps)
final = np.radians(final_gps)

tree = BallTree(final, leaf_size=12, metric='haversine')

Which took Wall time: 23.8 s

Query with

%%time

radius = 0.0000003

StoredOCC_indici = tree.query_radius(nbi, r=radius, return_distance=False, count_only=False)

(under a second)

And get the maximum of the things you are interested in

StoredOCC = [ np.max( Final.CUM_OCC[i] ) for i in StoredOCC_indici]

Which automatically make Nans for empty lists, which is nice. This took Wall time: 3.64 s

For this radius the whole calculation on 6 million took under a minute on my machine.

That is awesome, to fit OP's original question, the radius should be `radius = 0.03048/6372.8`, but yeah it is pretty fast — Ben.T, Mar 26 '21 at 01:09
The radius conversion is something to indeed take care of. I saw a factor of earth radius in the haversine function. — Willem Hendriks, Mar 26 '21 at 05:51

Optimizing/Parallel Computing a simple but big loop based on Pandas

2 Answers2