0

I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.

With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.

With more complex tasks, this increase is even more nuanced.

There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.

Why does this happen?

def worker_fn(data):
    t1 = time()    
    data.process()
    print time() - t1
    return data.results


def main( n, num_procs = 3):       
    from multiprocessing import Pool
    from cPickle import dumps, loads 

    pool = Pool(processes = num_procs)
    data = MyClass()
    data_pickle = dumps(data)
    list_data = [loads(data_pickle) for i in range(n)]
    results = pool.map(worker_fn,list_data)

Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.

navidoo
  • 309
  • 2
  • 11
  • 2
    Is the code a secret? – Amit Kumar Gupta Mar 01 '15 at 02:32
  • Sorry about that, I added (some of) the code. The MyClass code is massive, so I can't post it here. – navidoo Mar 01 '15 at 02:51
  • 1
    Please provide a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that reproduces the problem you're seeing. – Amit Kumar Gupta Mar 01 '15 at 02:58
  • 1
    There are lots of reasons why this might happen. For starters, if your threads share CPU cache you're likely to suffer a lot more cache misses, which can cause a big degradation in performance. Even if the workers don't use any shared memory your OS will still need to allocate memory for the copies of the input object, plus any other intermediate variables they need. The more workers competing for system memory, the slower this is likely to be. – ali_m Mar 01 '15 at 03:00
  • Well, @AmitKumarGupta, I can't. Like I said, the code is massive. If I could reduce it to a 10-liner that still reproduced the problem, then I'd probably have already solved the problem. Given that, I'm not asking anyone to debug my code, but to tell me if they know of possible mechanisms that could lead to it, like ali_m's comment. – navidoo Mar 01 '15 at 03:15
  • 1
    To be honest I think your question is a bit too broad to get a satisfactory answer as it stands. At the very least you should produce a minimal example that reproduces the phenomenon on your machine, but even then these sorts of issues are notoriously hardware and OS-dependent, so it's unlikely that it'll be possible to nail down a specific cause. You'd have a much better chance if your question was about how to optimize a specific piece of code that used multiprocessing. – ali_m Mar 01 '15 at 03:32
  • Is your code actually executing on multiple cores? If not, [have you tried this?](http://stackoverflow.com/a/15641148/1461210) – ali_m Mar 01 '15 at 10:42
  • @ali_m, that didn't solve the problem, but it seems to be in the right direction. Looking at htop's cpu load indicator, I was expecting to see 3 cores at full capacity throughout (pool has 3 processes), but I see cores 1 and 2 at roughly full capacity with 3 and 4 fluctuating. – navidoo Mar 03 '15 at 03:42
  • Are you doing any I/O in your worker processes? You really need to share a complete example, otherwise this is just a guessing game. – ali_m Mar 03 '15 at 12:33

0 Answers0