2

I am trying to use the pool class in multiprocessing module in python to do some data wrangling over a pandas data frame in parallel (code mentioned under 'Main code' heading below). The problem is my code gets stuck and does not finish running however small an input data frame (even as small as 10 rows) I provide to it. I also tried to run a simple example code (code mentioned under 'Pool example' heading below) and even that doesn't run.

Here is a detailed description of what i am trying to do in the code below: I have an indices dataframe which has 10 columns and 650K rows. The idea is to take the 10 values in each row of indices dataframe and for rows with those indexes from a target dataframe 'traindat', take a mean of a few of its columns . I have to do this for all rows of indices dataframe (650K).

Main code:

from multiprocessing import Pool
def func(x,i):
    dftmp=traindat.iloc[x,4:28].mean()
    return pd.DataFrame(dftmp).transpose()

pool = mp.Pool(processes=3)
new_rows = pool.map(func, [(row,idx) for idx,row in indices.iterrows()])
pool.close()
pool.join()
data_all_new = pd.concat(new_rows)

Since this code wouldn't run, I also tried the following simple code to see if pool runs at all for me. And it doesn't. Pool example:

import sys
sys.modules['__main__'].__file__ = 'ipython'
from multiprocessing import Pool
def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

I don't get any errors in my code. It simply gets stuck and doesn't finish running. Please help me if you understand this issue.

Edit: I later realized the issue only happens in Windows. So editing the question to include that.

  • why are you changing the module file? that breaks pickling, also `DataFrame.iterrows` is **really slow**, avoid it if you can! – Sam Mason Jun 29 '19 at 15:16
  • see https://stackoverflow.com/a/54219990/1358308 where I go from ~1 second with `iterrows` to ~2ms by using other things when processing 10k rows – Sam Mason Jun 29 '19 at 15:17
  • Thanks Sam! I edited the module based on my interpretation of this post: [https://stackoverflow.com/questions/34086112/python-multiprocessing-pool-stuck]. Neither the 'main code' nor the 'example code' run even if I don't modify the pool file. I would certainly try what you've mentioned about iterrows but I need to solve the more fundamental problems since even the 'example code' doesn't finish running. I think the problem might be related to the multiprocessing module installation on my system too. – Nagesh Rathi Jul 01 '19 at 15:16
  • might be worth using something other than a jupyter notebook, as that answer points out it kind of breaks things in subtle ways. if you are, I'd suggest editing the question and adding `jupyter-notebook` as a tag. also, I hope you're doing much more that your `func` does, you can do that sort of thing in numpy very efficiently, `np.mean(x[:,4:28], axis=1)[:,None]` where `x` is a 1M by 30 element matrix takes ~25ms for me. – Sam Mason Jul 01 '19 at 17:59
  • Essentially I have to calculate mean of 24 columns of a data frame (traindat) for 10 of its rows at time. I get the index of these '10 rows of traindat' from each row of a data frame that I am calling 'indices' above (it has 10 columns). I have to repeat this operation for all of the 650K rows of 'indices' data frame. – Nagesh Rathi Jul 02 '19 at 17:40
  • Does this answer your question? [python multiprocess don't finish properly](https://stackoverflow.com/questions/13395636/python-multiprocess-dont-finish-properly) – P_Sta May 11 '21 at 20:42

1 Answers1

0

I realized this is a duplicate question late with the help of a colleague. Posting link to the original question and answer in case someone stumbles upon this: Basic parallel python program freezes on Windows

Seems like this is an issue related to IDE not configured properly.