4

I'm trying to learn about multiprocessing in python (2.7). My CPU has 4 cores. In the following code I test speed of parallel Vs serial execution of the same basic instruction.

I find that the time taken using the 4 cores is only 0.67 the one taken by only one core, while naively I'd expect ~0.25.

Is overhead the reason? where does it come from? Are not the 4 processes independent?

I also tried pool.map and pool.map_async, with very similar results in terms of speed.

from multiprocessing import Process
import time

def my_process(a):
    for i in range(0,a[1]):
        j=0
        while j<10000:
            j = j+1
    print(a,j)

if __name__ == '__main__':
    # arguments to pass:
    a = ((0,2000),(1,2000),(2,2000),(3,2000))

    # --- 1) parallel processes:
    # 4 cores go up to 100% each here
    t0 = time.time()
    proc1 = Process(target=my_process, args=(a[0],))
    proc2 = Process(target=my_process, args=(a[1],))
    proc3 = Process(target=my_process, args=(a[2],))
    proc4 = Process(target=my_process, args=(a[3],))
    proc1.start(); proc2.start(); proc3.start(); proc4.start()
    proc1.join() ; proc2.join() ; proc3.join() ; proc4.join()
    dt_parallel = time.time()-t0
    print("parallel : " + str(dt_parallel))

    # --- 2) serial process :
    # 1 core only goes up to 100%
    t0 = time.time()
    for k in a:
        my_process(k)
    dt_serial = time.time()-t0
    print("serial : " + str(dt_serial))

    print("t_par / t_ser = " + str(dt_parallel/dt_serial))

EDIT my PC has actually 2 physical cores (2 = 2 cores per socket * 1 sockets, from lscpu [thanks @goncalopp]). If I run the above script with only the first 2 processes I get a ratio of 0.62, not that different to the one obtained with 3 or 4 processes. I guess it won't be easy to go faster than that.

I tested on another PC with lscpu: CPU(s):32, Thread(s) per core: 2, core(s) per socket: 8, Socket(s): 2, and I get a ratio of 0.34, similar to @dano.

Thanks for your help

scrx2
  • 2,242
  • 1
  • 12
  • 17
  • Larger inputs should push the ratio closer to 0.25. The unparallelized portion of the code that creates and starts the processes is included in your time. – chepner Oct 29 '14 at 12:28
  • For what its worth, my 8 core system gets a ratio of 0.26 for this same code. – dano Oct 29 '14 at 14:10
  • Adding another data point: Running the above code verbatim (at the time of this writing) resulted in a ratio near 0.33 the first time I ran it on my 4-core Windows 7 machine, then subsequently about 0.28, with a spike of approximately 60% CPU usage on all four cores during the parallel section. – John Y Oct 29 '14 at 14:58

1 Answers1

3

Yes, this may be related to overhead, including:

  • Creating and starting the processes
  • passing the function and the arguments over to them
  • waiting for process termination

If you truly have 4 physical cores on your machine (and not 2 cores with hyperthreading or similar), you should see that the ratio becomes closer to what is expected for larger inputs, as chepner said. If you only have 2 physical cores, you can't get ratio < 0.5

loopbackbee
  • 21,962
  • 10
  • 62
  • 97
  • In `/proc/cpuinfo` I see 4 processors listed. `lscpu` gives CPU(s):4; On-line CPU(s) list:0-3;Thread(s) per core:2; Core(s) per socket:2. Can I assume I have 4 physical cores? – scrx2 Oct 29 '14 at 13:50
  • 3
    @fpdx No. Note that `/proc/cpuinfo`, as well as `lscpu`'s `CPU(s)`, count the number of available *execution environments* ("cpu threads"). The number of actual cores in the system is `Core(s) per socket` times `Socket(s)` on `lscpu` – loopbackbee Oct 29 '14 at 14:07