I was trying to make my code parallel and I run into a strange thing that I am not able to explain.
Let me define the context. I have a really heavy computation to do, reading multiple files, performing machine learning analysis on it, a lot of math is involved. My code runs normally on Windows and Linux when is sequential, but when I try to use multiprocessing everything breaks. Below there is an example, that I developed first on Windows:
from multiprocessing.dummy import Pool as ThreadPool
def ppp(element):
window,day = element
print(window,day)
time.sleep(5)
return
if __name__ == '__main__'
#%% Reading datasets
print('START')
start_time = current_milli_time()
tree = pd.read_csv('datan\\days.csv')
days = list(tree.columns)
# to be able to run this code uncomment the following line and comment the previous two
# days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
windows = [1000]
processes_args = list(itertools.product(windows, days))
pool = ThreadPool(8)
results = pool.map_async(ppp, processes_args)
pool.close()
pool.join()
print('END', current_milli_time()-start_time, 'ms')
When I run this code on Windows the output looks like that:
START
100010001000 1000 1000100010001000 081008120808
08130814
0818
082708171000
1000
08290828
END 5036 ms
A messy set of prints in 125 ms. Same behavior on Linux too. However, I noticed that if I apply this method on Linux, and I look into 'htop', what I am seeing is a set of threads that are randomly picked for execution, but they never execute in parallel. Thus, after some google searches I came up with this new code:
from multiprocessing import Pool as ProcessPool
def ppp(element):
window,day = element
print(window,day)
time.sleep(5)
return
if __name__ == '__main__':
#%% Reading datasets
print('START')
start_time = current_milli_time()
tree = pd.read_csv('datan\\days.csv')
days = list(tree.columns)
# to be able to run this code uncomment the following line and comment the previous two
# days = ['0808', '0810', '0812', '0813', '0814', '0817', '0818', '0827', '0828', '0829']
windows = [1000]
processes_args = list(itertools.product(windows, days))
pool = ProcessPool(8)
results = pool.map_async(ppp, processes_args)
pool.close()
pool.join()
print('END', current_milli_time()-start_time, 'ms')
As you can see, I changed the import statement, which basically creates a Process pool instead of a Thread pool. That solves the problem on Linux, in fact in the real scenario, I have 8 processors running at 100% with 8 processes running in the system. The output looks like the one before. However, when I use this code on windows, more than 10 seconds are needed for the entire running, moreover, I am not getting any of the prints of ppp
, just the ones of the main.
I really tried to search for an explanation, but I am not understanding why that happens. For example here: Python multiprocessing Pool strange behavior in Windows, they talk about safe code on windows and the answer suggests to move to Threading, that, as a side effect, will make the code not parallel, but concurrent. Here another example: Python multiprocessing linux windows difference. All these questions describe fork()
and spawn
processes, but I personally think that the point of my question is not that. Python documentation still explains that windows does not have a fork()
method (https://docs.python.org/2/library/multiprocessing.html#programming-guidelines).
In conclusion, right now I am convincing that I cannot do parallel processing in Windows, but I think that what I am entailing from all these discussions is wrong. Thus, my question should be: is it possible to run processes or threads in parallel (on different CPUs) in Windows?
EDIT: add name == main in both the examples
EDIT2: to be able to run the code this function and these imports are needed:
import time
import itertools
current_milli_time = lambda: int(round(time.time() * 1000))