The problem is that you're using dummy. I.e. multithreading, not multiprocessing. Multithreading won't make CPU bound tasks faster, but only I/O bound tasks.
Try again with multiprocessing.Pool and you should have more success.
multiprocessing.dummy in Python is not utilising 100% cpu
Also you need to somehow chunk your input sequence into subsequences to make every process do enough calculations that it's worth it.
I put this into a solution. See that you need to call the multiprocessed pool from only on main execution, the problem is Python starts subengines that do each mapping.
import time
from multiprocessing import Pool as ThreadPool
def square(x):
return x*x
def squareChunk(chunk):
return [square(x) for x in chunk]
def chunks(l, n):
n = max(1, n)
return (l[i:i+n] for i in range(0, len(l), n))
def flatten(ll):
lst = []
for l in ll:
lst.extend(l)
return lst
if __name__ == '__main__':
start_time = time.time()
r1 = range(10000000)
nProcesses = 100
chunked = chunks(r1, int(len(r1)/nProcesses)) #split original list in decent sized chunks
pool = ThreadPool(4)
results = flatten(pool.map(squareChunk, chunked))
pool.close()
pool.join()
print("--- Parallel map %g seconds ---" % (time.time() - start_time))
start_time = time.time()
r2 = range(10000000)
squareChunk(r2)
print("--- Serial map %g seconds ---" % (time.time() - start_time))
I get the following printout:
--- Parallel map 3.71226 seconds ---
--- Serial map 2.33983 seconds ---
Now the question is shouldn't the parallel map be faster?
It could be that the whole chunking is costing us efficiency. But it could also be that the engine is more "warmed up" when the serial processing runs after. So I turned around the measurements:
import time
from multiprocessing import Pool as ThreadPool
def square(x):
return x*x
def squareChunk(chunk):
return [square(x) for x in chunk]
def chunks(l, n):
n = max(1, n)
return (l[i:i+n] for i in range(0, len(l), n))
def flatten(ll):
lst = []
for l in ll:
lst.extend(l)
return lst
if __name__ == '__main__':
start_time = time.time()
r2 = range(10000000)
squareChunk(r2)
print("--- Serial map %g seconds ---" % (time.time() - start_time))
start_time = time.time()
r1 = range(10000000)
nProcesses = 100
chunked = chunks(r1, int(len(r1)/nProcesses)) #split original list in decent sized chunks
pool = ThreadPool(4)
results = flatten(pool.map(squareChunk, chunked))
pool.close()
pool.join()
print("--- Parallel map %g seconds ---" % (time.time() - start_time))
And now I got:
--- Serial map 4.176 seconds ---
--- Parallel map 2.68242 seconds ---
So it's not so clear whether one or the other is faster. But if you want to do multiprocessing you have to think whether the overhead of creating the threads is actually much smaller than what you expect in speedup. You run into cache locality issues etc.