Good Day Peeps,
I currently have 2 data frames, "Locations" and "Pokestops", both containing a list of coordinates. The goal with these 2 data frames, is to cluster points from "Pokestops" that are within 70m of the points in "Locations".
I have created a "Brute Force" clustering script.
The process is as follows:
- Calculate which "Pokestops" are within 70m of each point in "Locations".
- Add all nearby Pokestops to Locations["Pokestops"], as a list/array of their index value eg, ([0, 4, 22])
- If no Pokestops are near a point in "Locations", remove that line from the Locations df
for i in range(len(locations)-1, -1, -1):
array = []
for f in range(0, len(pokestops)):
if geopy.distance.geodesic(locations.iloc[i, 2], pokestops.iloc[f, 2]).m <= 70:
array.append(f)
if len(array) <= 0:
locations.drop([i], inplace=True)
else:
locations.iat[i, 3] = array
locations["Length"] = locations["Pokestops"].map(len)
This results in:
Lat Long Coordinates Pokestops Length
2 -33.916432 18.426188 -33.916432,18.4261883 [1] 1
3 -33.916432 18.426287 -33.916432,18.42628745 [1] 1
4 -33.916432 18.426387 -33.916432,18.4263866 [1] 1
5 -33.916432 18.426486 -33.916432,18.42648575 [0, 1] 2
6 -33.916432 18.426585 -33.916432,18.4265849 [0, 1] 2
7 -33.916432 18.426684 -33.916432,18.426684050000002 [0, 1] 2
- Sort by most to least amount of pokestops within 70m.
locations.sort_values("Length", ascending=False, inplace=True)
This results in:
Lat Long Coordinates Pokestops Length
136 -33.915441 18.426585 -33.91544050000003,18.4265849 [1, 2, 3, 4] 4
149 -33.915341 18.426585 -33.915341350000034,18.4265849 [1, 2, 3, 4] 4
110 -33.915639 18.426585 -33.915638800000025,18.4265849 [1, 2, 3, 4] 4
111 -33.915639 18.426684 -33.915638800000025,18.426684050000002 [1, 2, 3, 4] 4
- Remove all index values listed in Locations[0, "Pokestops"], from all other rows Locations[1:, "Pokestops"]
stops = list(locations['Pokestops'])
seen = list(locations.iloc[0, 3])
stops_filtered = [seen]
for xx in stops[1:]:
xx = [x for x in xx if x not in seen]
stops_filtered.append(xx)
locations['Pokestops'] = stops_filtered
This results in:
Lat Long Coordinates Pokestops Length
136 -33.915441 18.426585 -33.91544050000003,18.4265849 [1, 2, 3, 4] 4
149 -33.915341 18.426585 -33.915341350000034,18.4265849 [] 4
110 -33.915639 18.426585 -33.915638800000025,18.4265849 [] 4
111 -33.915639 18.426684 -33.915638800000025,18.426684050000002 [] 4
- Remove all empty rows in Locations["Pokestops]
locations = locations[locations['Pokestops'].map(len)>0]
This results in:
Lat Long Coordinates Pokestops Length
136 -33.915441 18.426585 -33.91544050000003,18.4265849 [1, 2, 3, 4] 4
176 -33.915143 18.426684 -33.91514305000004,18.426684050000002 [5] 3
180 -33.915143 18.427081 -33.91514305000004,18.427080650000004 [5] 3
179 -33.915143 18.426982 -33.91514305000004,18.426981500000004 [5] 3
- Add Locations[0, "Coordinates"] to an array that can be saved to .txt later, which will form our final list of "Clustered" coordinates.
clusters = np.append(clusters, locations.iloc[0 , 0:2])
This results in:
Lat Long Coordinates Pokestops Length
176 -33.915143 18.426684 -33.91514305000004,18.426684050000002 [5] 3
180 -33.915143 18.427081 -33.91514305000004,18.427080650000004 [5] 3
179 -33.915143 18.426982 -33.91514305000004,18.426981500000004 [5] 3
64 -33.916035 18.427180 -33.91603540000001,18.427179800000005 [0] 3
- Repeat the process from 4-7 till the Locations df is empty.
This all results in an array containing all coordinates of points from the Locations dataframe, that contain points within 70m from Pokestops, sorted from Largest to Smallest cluster.
Now for the actual question.
The method I am using in steps 1-3, results in needing to loop a few million times for a small-medium dataset.
I believe I can achieve faster times by migrating away from using the "for" loops and allowing Pandas to calculate the distances between the two tables "Directly" using the geopy.distance.geodesic function.
I am just unsure how to even approach this...
- How do I get it to iterate through rows without using a for loop?
- How do I maintain using my "lists/arrays" in my locations["Pokestops"] column?
- Will it even be faster?
I know there is a library called GeoPandas, but this requires conda, and will mean I need to step away from being able to use my arrays/lists in the column Locations["Pokestops"]. (I also have 0 knowlege on how to use GeoPandas to be fair)
I know very broad questions like this are generally shunned, but I am fully self-taught in python, trying to achieve what is most likely too complicated of a script for my level.
I've made it this far, I just need this last step to make it more efficient. The script is fully working, and provides the required results, it simply takes too long to run due to the nested for loops.
Any advise/ideas are greatly appreciated, and keep in mind my knowlege on python/Pandas is somewhat limited and i do not know all the functions/terminology.
EDIT #1:
Thank you @Finn, although this solution has caused me to significantly alter my main body, this is working as intended.
With the new matrix, I am filtering everything> 0.07 to be NaN.
Lat Long Count 0 1 2 3 4
82 -33.904620 18.402612 5 NaN NaN NaN 0.052401 NaN
75 -33.904620 18.400183 5 NaN NaN NaN NaN 0.053687
120 -33.903579 18.401224 5 NaN NaN NaN NaN NaN
68 -33.904967 18.402612 5 NaN 0.044402 NaN 0.015147 NaN
147 -33.902885 18.400877 5 NaN NaN NaN NaN NaN
89 -33.904273 18.400183 5 NaN NaN NaN NaN NaN
182 -33.901844 18.398448 4 NaN NaN NaN NaN NaN
54 -33.905314 18.402612 4 NaN 0.020793 NaN 0.026215 NaN
183 -33.901844 18.398795 4 NaN NaN NaN NaN NaN
184 -33.901844 18.399142 4 NaN NaN NaN NaN NaN
The problem I face now is step 5 in my original post.
Can you advise how I would go about removing all columns that do NOT contain NaN in the 1st row?
The only info I can find is removing columns if ANY value in any row is not NaN.. I have tried every combination of .dropna() I could find online.