15

Yesterday i asked a question: Reading data in parallel with multiprocess

I got very good answers, and i implemented the solution mentioned in the answer i marked as correct.

def read_energies(motif):
    os.chdir("blabla/working_directory")
    complx_ener = pd.DataFrame()
    # complex function to fill that dataframe 
    lig_ener = pd.DataFrame()
    # complex function to fill that dataframe 
    return motif, complx_ener, lig_ener

COMPLEX_ENERGIS = {}
LIGAND_ENERGIES = {}
p = multiprocessing.Pool(processes=CPU)
for x in p.imap_unordered(read_energies, peptide_kd.keys()):
    COMPLEX_ENERGIS[x[0]] = x[1]
    LIGAND_ENERGIES[x[0]] = x[2]

However, this solution takes the same amount of time as if i would just iterate over peptide_kd.keys() and fill up the DataFrames one by one. Why is that so? Is there a way to fill up the desired dicts in parallel and actually get a speed increase? i am running it on a 48 core HPC.

Community
  • 1
  • 1
Gábor Erdős
  • 3,599
  • 4
  • 24
  • 56
  • It may be that the overhead of using multiprocessing is greater than that of doing the complex function processing. Perhaps having `read_energies()` process a variable number dataframes each time would allow you to tune things to the point were there it became advantageous. – martineau Jul 15 '16 at 10:36

1 Answers1

17

You are incurring a good amount of overhead in (1) starting up each process, and (2) having to copy the pandas.DataFrame (and etc) across several processes. If you just need to have a dict filled in parallel, I'd suggest using a shared memory dict. If no key will be overwritten, then it's easy and you don't have to worry about locks.

(Note I'm using multiprocess below, which is a fork of multiprocessing -- but only so I can demonstrate from the interpreter, otherwise, you'd have to do the below from __main__).

>>> from multiprocess import Process, Manager
>>> 
>>> def f(d, x):
...   d[x] = x**2
... 
>>> manager = Manager()
>>> d = manager.dict()
>>> job = [Process(target=f, args=(d, i)) for i in range(5)]
>>> _ = [p.start() for p in job]
>>> _ = [p.join() for p in job]
>>> print d
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

This solution doesn't make copies of the dict to share across processes, so that part of the overhead is reduced. For large objects like a pandas.DataFrame, it can be significant compared to the cost of a simple operation like x**2. Similarly, spawning a Process can take time, and you maybe be able to do the above even faster (for lightweight objects) by using threads (i.e. from multiprocess.dummy instead of multiprocess for either your originally posted solution or mine above).

If you do need to share DataFrames (as your code suggests instead of as the question asks), you might be able to do it by creating a shared memory numpy.ndarray.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • Thanks for the answer! I am going to try this now, but first i would like to ask something. I don't understand the difference between the mentioned `shared` dataframes (variables i guess). Why does my code imply that i use a shared DataFrame? The job i want to do in parallel is as you described, fill up a dictionary, and use it afterwards in different ways (read the data inside it), but don't change anything inside it. – Gábor Erdős Jul 25 '16 at 08:59
  • The reason I said you might look into shared memory arrays is that you are returning two `DataFrame` instances from each `Process`. However, it's hard to point you to whether you need to do that or not, as you only presented meta-code. – Mike McKerns Jul 25 '16 at 10:41
  • Ohh I see. I need both of the `DataFrames`. is it problematic to return two of them? Would it be easier to do this in two separate steps? – Gábor Erdős Jul 25 '16 at 12:55
  • No it't not problematic to return both, it's just expensive. you could however, make two shared memory array that are not returned… and just fill up the `DataFrame.values` accordingly. – Mike McKerns Jul 25 '16 at 16:02
  • I just want to thank you again for this great answer, it took me some time to fully understand it ( I am a bio engineer, i am just learning to code efficiently now) but i learned a lot from this. Sad i can't give you more upvotes :) – Gábor Erdős Sep 09 '16 at 09:54
  • No worries. I'm glad it helped. – Mike McKerns Sep 09 '16 at 14:07
  • where are you setting the number of processes ? – IssamLaradji Mar 29 '18 at 00:19
  • @Curious: Each invocation of `Process` creates a new process, so I'm setting the number of processes with `range(5)`. – Mike McKerns Mar 29 '18 at 04:26