3

I am using the Python multiprocessing module and am looking for a way to attach read only data once when the process is constructed. I want this data to persist across multiple jobs.

I planned to subclass Process and attach data to the class, something like this:

import multiprocessing

class Worker(multiprocessing.Process):
    _lotsofdata = LotsOfDataHolder()

    def run(self, arg):
        do something with _lotsofdata
        return value

if __name__ == '__main__':
    jobs = []
    for i in range(5):
        p = Worker()
        jobs.append(p)
        p.start()
    for j in jobs:
        j.join()

However, the number of jobs is on the order of 500k so I would rather use the Pool construct and I don't see a way to tell Pool to use a subclass of process.

Is there a way to tell Pool to use a subclass of Process or is there another way to persist data on a worker for multiple jobs that works with Pool?

Note: There is along explanation here, but subclassing process was not specifically discussed.

*I see now that the args are passed to the process constructor. This makes my approach all the more unlikely.

Community
  • 1
  • 1
Toaster
  • 1,911
  • 2
  • 23
  • 43

1 Answers1

0

As explained in this answer, multiple processes don't share the same memory space. This makes statements like persist data on a worker for multiple jobs meaningless: there's no way for a worker to access any other worker's data.

What multiprocessing can do is copying the same initial data over workers. This happens auto-magically:

import multiprocessing

_lotsofdata = [0]*1000
def run(arg):
    return arg+_lotsofdata[0]

pool= multiprocessing.Pool()
l=[1,2,3]
print pool.map(run, l)

If you don't want to copy memory, you're left to implement your own (OS dependent) mechanism for sharing state between processes. There's several approaches for that outlined in the linked answer.

Realistically, unless you're trying to do supercomputations on a cluster with dozens of CPUs, I'd think twice before going down that path.

Community
  • 1
  • 1
loopbackbee
  • 21,962
  • 10
  • 62
  • 97
  • 1
    Thanks. I understood that the processes/workers have their own address space. I don't want the workers to access each other's data. I want to attach data to each worker in the pool at the outset so this data can persist across jobs. The trick is I want say 8 workers to work on 500k jobs so I only want to make 8 copies of the data. Do I need to define the problem to be 8 workers and 8 big jobs? – Toaster Feb 04 '15 at 17:10
  • btw, from the documents. ''Pool: A process pool object which controls a pool of worker processes to which jobs can be submitted.'' So it should be possible to persist data on a worker for multiple jobs when using a Pool. – Toaster Feb 04 '15 at 17:22
  • @Colin What constitutes a "job" is not clearly defined, I suppose. IIRC, if you have a `Pool` of 8 workers and use `Pool.map` or similar functions, the number of copies is the number of workers, not the number of "jobs" - you just need to take some care with the `chunksize` parameter – loopbackbee Feb 04 '15 at 17:51