multiprocessing: processing a large dataset

Question

I am working with DEAP. I am evaluating a population (currently 50 individuals) against a large dataset (400.000 columns of 200 floating points). I have successfully tested the algorithm without any multiprocessing. Execution time is about 40s/generation. I want to work with larger populations and more generations, so I try to speed up by using multiprocessing.

I guess that my question is more related to multiprocessing than to DEAP. This question is not directly related to sharing memory/variables between processes. The main issue is how to minimise disk access.

I have started to work with Python multiprocessing module.

The code looks like this

toolbox = base.Toolbox()

creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

PICKLE_SEED = 'D:\\Application Data\\Dev\\20150925173629ClustersFrame.pkl'
PICKLE_DATA = 'D:\\Application Data\\Dev\\20150925091456DataSample.pkl'


if __name__ == "__main__":
    pool = multiprocessing.Pool(processes = 2)
    toolbox.register("map", pool.map)    

data = pd.read_pickle(PICKLE_DATA).values

And then, a little bit further:

def main():


    NGEN = 10
    CXPB = 0.5
    MUTPB = 0.2


    population = toolbox.population_guess()
    fitnesses = list(toolbox.map(toolbox.evaluate, population))
    print(sorted(fitnesses, reverse = True))
    for ind, fit in zip(population, fitnesses):
        ind.fitness.values = fit
    # Begin the evolution
    for g in range(NGEN):

The evaluation function uses the global "data" variable. and, finally:

if __name__ == "__main__":

    start = datetime.now()    
    main()
    pool.close()
    stop = datetime.now()
    delta = stop-start
    print (delta.seconds)

So: the main processing loop and the pool definition are guarded by if __name__ == "__main__":.

It somehow works. Execution times are: 1 process: 398 s 2 processes: 270 s 3 processes: 272 s 4 processes: 511 s

Multiprocessing does not dramatically improve the execution time, and can even harm it.

The 4 process (lack of) performance can be explained by memory constraints. My system is basically paging instead of processing.

I guess that the other measurements can be explained by the loading of data.

My questions:

1) I understand that the file will be read and unpickled each time the module is started as a separate process. Is this correct? Does this mean it will be read each time one of the functions it contains will be called by map?

2) I have tried to move the unpickling under the if __name__ == "__main__": guard, but, then, I get an error message saying the "data" is not defined when I call the evaluation function. Could you explain how I can read the file once, and then only pass the array to the processes

Possible duplicate of [Python multiprocessing shared memory](http://stackoverflow.com/questions/14124588/python-multiprocessing-shared-memory) — rll, Oct 03 '15 at 14:53
My understanding is that the way Python moves information and functions to another process is via pickling. Thus each time you send a function / parameters, or return a result, there is pickling involved. The overhead of the pickling can outweigh the benefit unless the nature of the work is such that you can spool up something in another process and let it do some meaningful quantity of work. Otherwise threads are a better bet. — songololo, Oct 03 '15 at 15:09
Thanks. I'd like first to understand (and solve)the disk access question: I'd rather pass pickled data than access disk. And, on the longer term, I'd like to distribute the workload over several systems, so multithread is no real option. SCOOP is an option, but basically has the same constraints as mustiprocessing. — Bruno Hanzen, Oct 03 '15 at 18:19

multiprocessing: processing a large dataset

0 Answers0