I am working with DEAP. I am evaluating a population (currently 50 individuals) against a large dataset (400.000 columns of 200 floating points). I have successfully tested the algorithm without any multiprocessing. Execution time is about 40s/generation. I want to work with larger populations and more generations, so I try to speed up by using multiprocessing.
I guess that my question is more related to multiprocessing than to DEAP. This question is not directly related to sharing memory/variables between processes. The main issue is how to minimise disk access.
I have started to work with Python multiprocessing module.
The code looks like this
toolbox = base.Toolbox()
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
PICKLE_SEED = 'D:\\Application Data\\Dev\\20150925173629ClustersFrame.pkl'
PICKLE_DATA = 'D:\\Application Data\\Dev\\20150925091456DataSample.pkl'
if __name__ == "__main__":
pool = multiprocessing.Pool(processes = 2)
toolbox.register("map", pool.map)
data = pd.read_pickle(PICKLE_DATA).values
And then, a little bit further:
def main():
NGEN = 10
CXPB = 0.5
MUTPB = 0.2
population = toolbox.population_guess()
fitnesses = list(toolbox.map(toolbox.evaluate, population))
print(sorted(fitnesses, reverse = True))
for ind, fit in zip(population, fitnesses):
ind.fitness.values = fit
# Begin the evolution
for g in range(NGEN):
The evaluation function uses the global "data" variable. and, finally:
if __name__ == "__main__":
start = datetime.now()
main()
pool.close()
stop = datetime.now()
delta = stop-start
print (delta.seconds)
So: the main processing loop and the pool definition are guarded by if __name__ == "__main__":
.
It somehow works. Execution times are: 1 process: 398 s 2 processes: 270 s 3 processes: 272 s 4 processes: 511 s
Multiprocessing does not dramatically improve the execution time, and can even harm it.
The 4 process (lack of) performance can be explained by memory constraints. My system is basically paging instead of processing.
I guess that the other measurements can be explained by the loading of data.
My questions:
1) I understand that the file will be read and unpickled each time the module is started as a separate process. Is this correct? Does this mean it will be read each time one of the functions it contains will be called by map?
2) I have tried to move the unpickling under the if __name__ == "__main__":
guard, but, then, I get an error message saying the "data" is not defined when I call the evaluation function. Could you explain how I can read the file once, and then only pass the array to the processes