2

I was trying to make my code parallel and I run into a strange thing that I am not able to explain.

Let me define the context. I have a really heavy computation to do, reading multiple files, performing machine learning analysis on it, a lot of math is involved. My code runs normally on Windows and Linux when is sequential, but when I try to use multiprocessing everything breaks. Below there is an example, that I developed first on Windows:

from multiprocessing.dummy import Pool as ThreadPool 

def ppp(element):
    window,day = element
    print(window,day)
    time.sleep(5)
    return

if __name__ == '__main__'    
    #%% Reading datasets
    print('START')
    start_time = current_milli_time()
    tree = pd.read_csv('datan\\days.csv')
    days = list(tree.columns)
    # to be able to run this code uncomment the following line and comment the previous two
    # days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
    windows = [1000]
    processes_args = list(itertools.product(windows, days))

    pool = ThreadPool(8) 
    results = pool.map_async(ppp, processes_args)
    pool.close() 
    pool.join() 
    print('END', current_milli_time()-start_time, 'ms')

When I run this code on Windows the output looks like that:

START
100010001000 1000 1000100010001000      081008120808
08130814
0818
082708171000
1000    
  08290828

END 5036 ms

A messy set of prints in 125 ms. Same behavior on Linux too. However, I noticed that if I apply this method on Linux, and I look into 'htop', what I am seeing is a set of threads that are randomly picked for execution, but they never execute in parallel. Thus, after some google searches I came up with this new code:

from multiprocessing import Pool as ProcessPool

def ppp(element):
    window,day = element
    print(window,day)
    time.sleep(5)
    return

if __name__ == '__main__':
    #%% Reading datasets
    print('START')
    start_time = current_milli_time()
    tree = pd.read_csv('datan\\days.csv')
    days = list(tree.columns)
    # to be able to run this code uncomment the following line and comment the previous two
    # days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
    windows = [1000]
    processes_args = list(itertools.product(windows, days))

    pool = ProcessPool(8) 
    results = pool.map_async(ppp, processes_args)
    pool.close() 
    pool.join() 
    print('END', current_milli_time()-start_time, 'ms')

As you can see, I changed the import statement, which basically creates a Process pool instead of a Thread pool. That solves the problem on Linux, in fact in the real scenario, I have 8 processors running at 100% with 8 processes running in the system. The output looks like the one before. However, when I use this code on windows, more than 10 seconds are needed for the entire running, moreover, I am not getting any of the prints of ppp, just the ones of the main.

I really tried to search for an explanation, but I am not understanding why that happens. For example here: Python multiprocessing Pool strange behavior in Windows, they talk about safe code on windows and the answer suggests to move to Threading, that, as a side effect, will make the code not parallel, but concurrent. Here another example: Python multiprocessing linux windows difference. All these questions describe fork() and spawn processes, but I personally think that the point of my question is not that. Python documentation still explains that windows does not have a fork() method (https://docs.python.org/2/library/multiprocessing.html#programming-guidelines).

In conclusion, right now I am convincing that I cannot do parallel processing in Windows, but I think that what I am entailing from all these discussions is wrong. Thus, my question should be: is it possible to run processes or threads in parallel (on different CPUs) in Windows?

EDIT: add name == main in both the examples

EDIT2: to be able to run the code this function and these imports are needed:

import time
import itertools    
current_milli_time = lambda: int(round(time.time() * 1000))
Guido Muscioni
  • 1,203
  • 3
  • 15
  • 37

2 Answers2

3

You can do parallel processing under Windows (I have a script running now doing heavy computations and using 100% of all 8 cores) but the way it works is by creating parallel processes, not threads (which won't work because of the GIL except for I/O operations). A few important points:

  • you need to use concurrent.futures.ProcessPoolExecutor() (note it's the process pool not the thread pool). See https://docs.python.org/3/library/concurrent.futures.html. In a nutshell, the way it works is that you put the code you want to parallelize in a function and then you call executor.map() which will done the split.
  • note that on Windows each parallel process will start from scratch, so you'll probably need to use if __name__ == '__main__:' in a few places to distinguish between what you do in the main process vs the others. The data that you load in the main script will be replicated to the child processes so it has to be serializable (pickl'able in Python lingo).
  • in order to efficiently use the core, avoid writing data to objects shared across all processes (eg. a progress counter or a common data structure). Otherwise the synchronization between the processes will kill performance. So monitor the execution from the task manager.
wishihadabettername
  • 14,231
  • 21
  • 68
  • 85
1

under windows, python use pickle/unpickle to mimic fork in multiprocessing module, when doing unpickle, the module get reimported, any code in global scope execute again, the docs stated:

Instead one should protect the “entry point” of the program by using if name == 'main'

besides, you should cosume the AsyncResult returned by pool.map_async, or simply use pool.map.

georgexsh
  • 15,984
  • 2
  • 37
  • 62