Concurrent Futures: When and how to implement?

Question

from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
import numpy as np
import time

#creating iterable
testDict = {}
for i in range(1000):
    testDict[i] = np.random.randint(1,10)
    

#default method
stime = time.time()    
newdict = []

for k, v in testDict.items():
    for i in range(1000):
        v = np.tanh(v)
    newdict.append(v)
    
etime = time.time()
print(etime - stime) 
#output: 1.1139910221099854  



#multi processing
stime = time.time()
testresult = []

def f(item):
    x = item[1]
    for i in range(1000):
        x = np.tanh(x)
    return x

def main(testDict):
    with ProcessPoolExecutor(max_workers = 8) as executor:
        futures = [executor.submit(f, item) for item in testDict.items()]
        for future in as_completed(futures):
            testresult.append(future.result())
            
if __name__ == '__main__':
    main(testDict)    

etime = time.time()
print(etime - stime)
#output: 3.4509658813476562

Learning multiprocessing and testing stuff. Ran a test to check if I have implemented this correctly. Looking at the output time taken, concurrent method is 3 times slower. So what's wrong?

My objective is to parallelize a script which mostly operates on a dictionary of around 500 items. Each loop, values of those 500 items are processed and updated. This loops for let's say 5000 generations. None of the k,v pairs interact with other k,v pairs. [Its a genetic algorithm].

I am also looking at guidance on how to parallelize the above described objective. If I use the correct concurrent futures method on each of my function in my genetic algorithm code, where each function takes an input of a dictionary and outputs a new dictionary, will it be useful? Any guides/resources/help is appreciated.

Edit: If I run this example: https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor-example, it takes 3 times more to solve than a default for loop check.

How many cores are on your machine? Does the trend you observe continue if you change the iterations from 1000 to 10000? 100000? You may be just observing the parallelization overhead by using such a small dataset. Or, if you have less than 8 cores, you might be just overloading your CPU. — SethMMorton, Sep 21 '20 at 03:47
@SethMMorton 4 cores. Ran with 10000 and saw same 3x ratio. Overhead is an avenue of inquiry but if you could please check out the edit I made to my post: even the documentation example is running slower than a loop on a list. — Mashhood Ahmad, Sep 21 '20 at 05:25
What if you use only 4 workers? You seem to be creating twice the process than cores. — SethMMorton, Sep 21 '20 at 05:39

Lie Ryan · Accepted Answer · 2020-09-21T06:59:10.050

There are a couple basic problems here, you're using numpy but you're not vectorizing your calculations. You'll not benefit from numpy's speed benefit with the way you write your code here, and might as well just use the standard library math module, which is faster than numpy for this style of code:

# 0.089sec
import math
for k, v in testDict.items():
    for i in range(1000):
        v = math.tanh(v)
    newdict.append(v)

Once you vectorise the operation, only then you see the benefit of numpy:

# 0.016sec
for k, v in testDict.items():
    arr = no.full(1000, v)
    arr2 = np.tanh(arr)
    newdict.append(arr2[-1])

For comparison, your original single threaded code runs in 1.171sec on my test machine. As you can see here, when it's not used properly, NumPy can be a couple orders of magnitude slower than even pure Python.

Now on to why you're seeing what you're seeing.

To be honest, I can't replicate your timing results. Your original multiprocessing code runs in 0.299sec for me macOS on Python 3.6), which is faster than the single process code. But if I have to take a guess, you're probably using Windows? In some platforms like Windows, creating a child process and setting up an environment to run multiprocessing task is very expensive, so using multiprocessing for a task that lasts less than a few seconds is of dubious benefit. If your are interested in why, read here.

Also, in platforms that lacks a usable fork() like MacOS after Python 3.8 or Windows, when you use multiprocessing, the child process has to reimport the module, so if you put both code in the same file, it has to run your single threaded code in the child processes before it can run the multiprocessing code. You'll likely want to put your test code in a function and protect the top level code with if __name__ == "__main__" block. On Mac with Python 3.8 or higher, you can also revert to using fork method by calling multiprocessing.set_start_method("fork") if you're not calling into Mac's non-fork-safe framework libraries.

With that out of the way, on to your title question.

When you use multiprocessing, you need to copy data to the child process and back to the main process to retrieve the result and there's a cost to spawning child processes. To benefit from multiprocessing, you need to design your workload so that this part of the cost is negligible.

If your data comes from external source, try loading the data in the child processes, rather than having the main process load the data then transfer it to the child process, have the main process tell the child how to fetch its slice of data. Here you're generating the testDict in the main process, so if you can, parallelize that and move them to the children instead.

Also, since you're using numpy, if you vectorise your operations properly, numpy will release the GIL while doing vectorised operations, so you may be able to just use multithreading instead. Since numpy doesn't hold GIL during vector operation, you can take advantage of multiple threads in a single Python process, and you don't need to fork or copy data over to child processes, as threads share memory.

I am learning vectorization, hence I have to ask in the context of my project on Genetic Algorithm. If I represent a gene with random permutation of an array of len 6, by let's say [0 2 3 4 1 5] and represent my population of let's say 200 individuals in a numpy array with the shape of (200, 6), will vectorization help since my fundamental unit is not a number from a row, lets say 2, on which I have to apply functions, but my fundamental unit in the whole row itself. After a couple of operations, the example row might be changed to [5 4 3 0 1 2]. Same thing will happen with other 199 rows. — Mashhood Ahmad, Sep 22 '20 at 18:52
@MashhoodAhmad: yes, vectorize your code first before playing with multiprocessing. If your problem isn't one that can be easily vectorized than you probably don't need numpy, so plain python might work just fine. You probably want to make a separate post to ask how to vectorise your problem, there isn't enough details in your comment to say one way or another. That said, a lot of machine learning and genetic algorithm problems are well suited for vectorised algorithms that numpy is good at, although they're not necessarily always obvious how. — Lie Ryan, Sep 23 '20 at 12:38

Concurrent Futures: When and how to implement?

1 Answers1