In Python 3.6, I am running multiple processes in parallel, where each process pings a URL and returns a Pandas dataframe. I want to keep running the (2+) processes continually, I have created a minimal representative example as below.
My questions are:
1) My understanding is that since I have different functions, I cannot use Pool.map_async()
and its variants. Is that right? The only examples of these I have seen were repeating the same function, like on this answer.
2) What is the best practice to make this setup to run perpetually? In my code below, I use a while
loop, which I suspect is not suited for this purpose.
3) Is the way I am using the Process
and Manager
optimal? I use multiprocessing.Manager.dict()
as the shared dictionary to return the results form the processes. I saw in a comment on this answer that using a Queue
here would make sense, however the Queue
object has no `.dict()' method. So, I am not sure how that would work.
I would be grateful for any improvements and suggestions with example code.
import numpy as np
import pandas as pd
import multiprocessing
import time
def worker1(name, t , seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
time.sleep(t)
np.random.seed(seed)
df= pd.DataFrame(np.random.randint(0,1000,8).reshape(2,4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
def worker2(name, t, seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
np.random.seed(seed)
time.sleep(t)
df = pd.DataFrame(np.random.randint(0, 1000, 12).reshape(3, 4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
if __name__ == '__main__':
t=1
while True:
start_time = time.time()
manager = multiprocessing.Manager()
parallel_dict = manager.dict()
seed=np.random.randint(0,1000,1) # send seed to worker to return a diff df
jobs = []
p1 = multiprocessing.Process(target=worker1, args=('name1', t, seed, parallel_dict))
p2 = multiprocessing.Process(target=worker2, args=('name2', t, seed+1, parallel_dict))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
for proc in jobs:
proc.join()
parallel_end_time = time.time() - start_time
#print(parallel_dict)
df1= pd.DataFrame(parallel_dict['name1'][1:],columns=parallel_dict['name1'][0])
df2 = pd.DataFrame(parallel_dict['name2'][1:], columns=parallel_dict['name2'][0])
merged_df = pd.concat([df1,df2], axis=0)
print(merged_df)