Best way to use parallel computing within a for-loop in Python

Question

I am new to parallel computing and am trying to speed up my code. I have read about several possibilities, using Dask or Tensorflow for example, but I struggle to see what the most efficient method is for my goals.

I work with global geospatial data mainly in geopandas. For example, I have 2 GeoDataFrames: left with 2000 points and right with 100'000 points. I use a for loop to find the 5 nearest neighbors in right for each point in left, using the balltree function from KNN. Something like:

df = pd.DataFrame(...)  # empty dataframe to store the neighbors for each point

# loop through each point in left, find the nearest neighbors and store them in df
for n in range(len(left)):
    neighbors = balltree_function(left.iloc[n], right)
    df.loc[n] = neighbors

This works and I can then combine left and df to get one DF with all the points and the neighbors, but it takes a lot of time (in the order of 30 minutes for the above numbers, but I also have datasets with millions of points that I want to match, then it takes hours).

So my question is, what would be an efficient way to parallelise this process? It should be quite simple to do since all the iterations in the loop are independent. I'm using native miniforge3 on a MacBook Pro M1 Max.

Thanks for the help!

It should be quite simple to rebuild this with [`ProcessPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor). Just keep all operations with main dataframe in main thread. — Olvin Roght, Mar 25 '22 at 11:08
there's a new approach in **geopandas** family https://github.com/geopandas/dask-geopandas — Rob Raymond, Mar 25 '22 at 13:41

Best way to use parallel computing within a for-loop in Python

0 Answers0