I'm new to parrallel programming. My task is to analyze hundreds of data files. Each of those data is nearly 300MB, and could be sliced into numerous slices. My computer is a 4-core pc. And I want to get the result of each data as soon as possible.
The analysis of each data file consists of 2 procedures. First, read data into memory, and then slice it into slices, which is io intensive work. Then, do lots of computation for the slices of this file, which is cpu intensive.
So my strategy is group this files in group of 4. For each group of these files, first, read all data of 4 files into memory with 4 processes in 4 cores. The code is like,
with Pool(processes=4) as pool:
data_list = pool.map(read_and_slice, files) # len(files)==4
Then for each data
in data_list
, do computation work with 4 processes.
for data in data_list: # I want to get the result of each data asap
with Pool(processes=4) as pool:
result_list = pool.map(compute, data.slices) # anaylyze each slice of data
analyze(result_list) # analyze the results of previous procedure, for example, get the average.
And then go for another group.
So the problem is that during the whole process of computation of hundreds of files, the pool is recreated many times. How could I avoid the overhead of recreating pools and processes? Is there any substantial memory overhead in my code? And is there a better way for me to make the time needed as less as possible?
Thanks!