Running an external function within Pandas Dataframe to speed up processing loops

Question

Good Day Peeps,

I currently have 2 data frames, "Locations" and "Pokestops", both containing a list of coordinates. The goal with these 2 data frames, is to cluster points from "Pokestops" that are within 70m of the points in "Locations".

I have created a "Brute Force" clustering script.

The process is as follows:

Calculate which "Pokestops" are within 70m of each point in "Locations".
Add all nearby Pokestops to Locations["Pokestops"], as a list/array of their index value eg, ([0, 4, 22])
If no Pokestops are near a point in "Locations", remove that line from the Locations df

for i in range(len(locations)-1, -1, -1):
    array = []
    for f in range(0, len(pokestops)):
        if geopy.distance.geodesic(locations.iloc[i, 2], pokestops.iloc[f, 2]).m <= 70:
            array.append(f)
    if len(array) <= 0:
        locations.drop([i], inplace=True)
    else:
        locations.iat[i, 3] = array
        locations["Length"] = locations["Pokestops"].map(len)

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
2   -33.916432  18.426188                   -33.916432,18.4261883           [1]       1
3   -33.916432  18.426287                  -33.916432,18.42628745           [1]       1
4   -33.916432  18.426387                   -33.916432,18.4263866           [1]       1
5   -33.916432  18.426486                  -33.916432,18.42648575        [0, 1]       2
6   -33.916432  18.426585                   -33.916432,18.4265849        [0, 1]       2
7   -33.916432  18.426684           -33.916432,18.426684050000002        [0, 1]       2

Sort by most to least amount of pokestops within 70m.

locations.sort_values("Length", ascending=False, inplace=True)

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
149 -33.915341  18.426585          -33.915341350000034,18.4265849  [1, 2, 3, 4]       4
110 -33.915639  18.426585          -33.915638800000025,18.4265849  [1, 2, 3, 4]       4
111 -33.915639  18.426684  -33.915638800000025,18.426684050000002  [1, 2, 3, 4]       4

Remove all index values listed in Locations[0, "Pokestops"], from all other rows Locations[1:, "Pokestops"]

    stops = list(locations['Pokestops'])
    seen = list(locations.iloc[0, 3])
    stops_filtered = [seen]
    for xx in stops[1:]:
        xx = [x for x in xx if x not in seen]
        stops_filtered.append(xx)
    locations['Pokestops'] = stops_filtered

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
149 -33.915341  18.426585          -33.915341350000034,18.4265849            []       4
110 -33.915639  18.426585          -33.915638800000025,18.4265849            []       4
111 -33.915639  18.426684  -33.915638800000025,18.426684050000002            []       4

Remove all empty rows in Locations["Pokestops]

locations = locations[locations['Pokestops'].map(len)>0]

This results in:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
176 -33.915143  18.426684   -33.91514305000004,18.426684050000002           [5]       3
180 -33.915143  18.427081   -33.91514305000004,18.427080650000004           [5]       3
179 -33.915143  18.426982   -33.91514305000004,18.426981500000004           [5]       3

Add Locations[0, "Coordinates"] to an array that can be saved to .txt later, which will form our final list of "Clustered" coordinates.

clusters = np.append(clusters, locations.iloc[0 , 0:2])

This results in:

           Lat       Long                             Coordinates Pokestops  Length
176 -33.915143  18.426684   -33.91514305000004,18.426684050000002       [5]       3
180 -33.915143  18.427081   -33.91514305000004,18.427080650000004       [5]       3
179 -33.915143  18.426982   -33.91514305000004,18.426981500000004       [5]       3
64  -33.916035  18.427180   -33.91603540000001,18.427179800000005       [0]       3

Repeat the process from 4-7 till the Locations df is empty.

This all results in an array containing all coordinates of points from the Locations dataframe, that contain points within 70m from Pokestops, sorted from Largest to Smallest cluster.

Now for the actual question.

The method I am using in steps 1-3, results in needing to loop a few million times for a small-medium dataset.

I believe I can achieve faster times by migrating away from using the "for" loops and allowing Pandas to calculate the distances between the two tables "Directly" using the geopy.distance.geodesic function.

I am just unsure how to even approach this...

How do I get it to iterate through rows without using a for loop?
How do I maintain using my "lists/arrays" in my locations["Pokestops"] column?
Will it even be faster?

I know there is a library called GeoPandas, but this requires conda, and will mean I need to step away from being able to use my arrays/lists in the column Locations["Pokestops"]. (I also have 0 knowlege on how to use GeoPandas to be fair)

I know very broad questions like this are generally shunned, but I am fully self-taught in python, trying to achieve what is most likely too complicated of a script for my level.

I've made it this far, I just need this last step to make it more efficient. The script is fully working, and provides the required results, it simply takes too long to run due to the nested for loops.

Any advise/ideas are greatly appreciated, and keep in mind my knowlege on python/Pandas is somewhat limited and i do not know all the functions/terminology.

EDIT #1:

Thank you @Finn, although this solution has caused me to significantly alter my main body, this is working as intended.

With the new matrix, I am filtering everything> 0.07 to be NaN.

          Lat       Long  Count   0         1   2         3         4
82  -33.904620  18.402612      5 NaN       NaN NaN  0.052401       NaN
75  -33.904620  18.400183      5 NaN       NaN NaN       NaN  0.053687
120 -33.903579  18.401224      5 NaN       NaN NaN       NaN       NaN
68  -33.904967  18.402612      5 NaN  0.044402 NaN  0.015147       NaN
147 -33.902885  18.400877      5 NaN       NaN NaN       NaN       NaN
89  -33.904273  18.400183      5 NaN       NaN NaN       NaN       NaN
182 -33.901844  18.398448      4 NaN       NaN NaN       NaN       NaN
54  -33.905314  18.402612      4 NaN  0.020793 NaN  0.026215       NaN
183 -33.901844  18.398795      4 NaN       NaN NaN       NaN       NaN
184 -33.901844  18.399142      4 NaN       NaN NaN       NaN       NaN

The problem I face now is step 5 in my original post.

Can you advise how I would go about removing all columns that do NOT contain NaN in the 1st row?

The only info I can find is removing columns if ANY value in any row is not NaN.. I have tried every combination of .dropna() I could find online.

Not sure on the best way to add the data for testing. But the list of "pokestops" you can use are: [linl]https://pastebin.com/4BgdsbVc And the larger list, which is "Locations", is here: [link]https://pastebin.com/Ss5Sya4g — Jebula999, Jul 12 '22 at 11:01
It's a bit messy, but my full script can be found here: https://pastebin.com/4t0bUjke — Jebula999, Jul 12 '22 at 11:10

score 0 · Answer 1 · answered Jul 12 '22 at 10:41

The Apply function Might be helpful. The Apply function applies the specified function to every cell of the Dataset (Of course you can control the parameters). Check this documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) for further understanding.

I do believe loops will be very chaotic to implement once the solution hides in multiple layers. From the perspective of playing with datasets it is far better without a loop approach and apply functions as here we are expected to provide quick solutions.

score 0 · Accepted Answer · answered Jul 13 '22 at 07:56

I don't fully understand your code, but from your text there are a few things you can do to speed this up. I think the most beneficial thing to do is to vectorize your distance calculations, because looping to every combination takes forever. So i stole the calculation from this answer and adopted it to create a matrix:

import numpy as np
import pandas as pd

pokestops = pd.read_csv('Pokestops.txt', header=None)
pokestops.columns = ["Lat", "Long"]
locations = pd.read_csv('Locations.txt')
locations.columns = ["Lat", "Long"]

def haversine_np_matrix(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    lon1 = np.expand_dims(lon1, axis=0)
    lat1 = np.expand_dims(lat1, axis=0)
    lon2 = np.expand_dims(lon2, axis=1)
    lat2 = np.expand_dims(lat2, axis=1)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

distances = haversine_np_matrix(pokestops["Long"],pokestops["Lat"], locations["Long"],locations['Lat'])

This gives you the distance from each Location to each Pokestop. Now you can use something like distances < 0.07 to find all that are closer than 70 meter. For this to work i stripped Location.txt of everything but Long and Lat. I am not sure meters = 10 and degrees = 0.000009915 in your text do, so you may have to adapt the calculation and you may want to compare the 6367km to the calculation from geopy.distance.geodesic as described here to get the same results.

Thank you, the method of creating a Matrix has helped a lot, but in the same turn has completely changed the way I sort information. The last part I am stuck on is the equivalent of step 5 in my original post. I added an Edit to my original post to better show the new df after filtering, are you able to advise on how I can remove all columns that do NOT contain NaN in the first row? — Jebula999, Jul 13 '22 at 15:59
you can filter for the rows with nan and then just them again, like `only_nan_locations = locations[np.isnan(locations['0'])]`. — Finn, Jul 14 '22 at 08:56

Running an external function within Pandas Dataframe to speed up processing loops

EDIT #1:

2 Answers2