2

I am trying to parallelize a function on my pandas dataframe and I'm running into an issue where it seems that the multiprocessing library is hanging. I am doing this all within a Jupyter notebook with myFunction() existing in a separate .py file. Can someone point out what I am doing wrong here?

Surprisingly, this piece of code has worked previously on my Windows 7 machine with the same version of python. I have just copied the file over to my Mac laptop.

I also use tqdm so I can monitor the progress, the behavior is the same with or without it.

#This function hands the multiprocessing
from multiprocessing import Pool, cpu_count
import numpy as np
import tqdm

def parallelize_dataframe(df, func):
    num_partitions = cpu_count()*2       # number of partitions to split dataframe
    num_cores = cpu_count()              # number of cores on your machine
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    return pd.concat(list(tqdm.tqdm_notebook(pool.imap(func, df_split),total=num_partitions)))



#My function that I am applying to the dataframe is in another file
#myFunction retrieves a JSON from an API for each ID in myDF and converts it to a dataframe
from myFuctions import myFunction

#Code that calls the parallelize function
finalDF = parallelize_dataframe(myDF,myFunction)

The expected result is a concatenation of a list of dataframes that have been retrieved by myFunction(). This is worked in the past, but now the process seems to hang indefinitely without any error messages.

user3666197
  • 1
  • 6
  • 50
  • 92
KKarabinas
  • 41
  • 1
  • 5
  • i doubt this is very useful but take a look at the package dask. I think this will make your work much simpler instead of manually splitting and compiling the dataframe – Anonymous Dodo Jul 24 '19 at 19:38

1 Answers1

0

Q : Can someone point out what I am doing wrong here?

You just expected the MacOS to use the same mechanism for process-instantiations as the WinOS did in past.

The multiprocessing module does not do the same set of things on either of the supported O/S-es and even reported some methods to be dangerous and also had changed the default behaviour on MacOS- and Linux-based systems.

Next steps to try to move forward :

  • re-read how to do the explicit setup of the call-signatures in multiprocessing documentation ( avoid hidden dependency of the code-behaviour on "new" default values )

  • test if may avoid the cases where multiprocessing will spawn the full-copy of the python-interpreter process, that many times as you instruct ( memory allocations could soon get devastatingly large, if many replicas try to get instantiated beyond the localhost RAM-footprint, just due to a growing number of CPU-cores )

  • test if the "worker"-code is not computing intensive but rather network-remote API-call latency driven. In such a case asyncio/await decorated tools will help more with latency-masking than going into in the case of IO-latency dominated use-cases inefficient multiprocessing spawned and rather expensive full-copy concurrency of many python-processes (that just stay waitin for receiving remote-API answers ).

  • last but not least - performance-sensitive code best runs outside any mediating-ecosystem, like the interactivity-focused Jupyter-notebooks are.

user3666197
  • 1
  • 6
  • 50
  • 92