Python mutiprocessing with big shared data

Question

I am using python to develop an app to process data using mutliprocessing module, the code looks like this:

import multiprocessing

globalData = loadData() #very large data 

def f(v):
    global globalData
    return someOperation(globalData,v)

if __name__ == '__main__':
    pool = multiprocessing.Pool()
    arr = loadArray() #some big list
    res = pool.map(f,arr)

The problem is that all child processes needs the same global data to process the function, so it loads it and takes a long time, what is the best solution to share this data among all child processes, as it is already loaded in the parent?

Python should leverage the copy-on-write mechanism when you use global variables... Are you modifying your `globalData` object? If so, you might want to use threading instead of multiprocessing — Fred, Dec 18 '18 at 11:36
No it is read-only, but it seems it will not copy it if I am using windows, or do I miss something? — ammcom, Dec 18 '18 at 11:43
Well... I don't know what to tell you. Read-only data shared between processes shouldn't be copied. Relevant: https://stackoverflow.com/questions/38084401/leveraging-copy-on-write-to-copy-data-to-multiprocessing-pool-worker-process — Fred, Dec 18 '18 at 11:46

Roland Smith · Answer 1 · 2018-12-20T21:58:44.193

Multiprocessing on ms-windows works differently from UNIX-like systems.

UNIX-like systems have the fork system call, which makes a copy of the current process. In modern systems with copy-on-write virtual memory management, this is not even a very expensive operation.

This means that global data in the parent process will be shared with the child process, until the child process writes to that page, in which case it will be copied.

The thing is that ms-windows doesn't have fork. It has CreateProcess instead. So on ms-windows, this happens:

The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.

So since your global data is referenced in your function it will be loaded. But every child process will load it separately.

What you could try is have your processes load the data using mmap with ACCESS_READ. I would expect that the ms-windows memory subsystem is smart enough to only load the data once in case the same file is loaded by multiple processes.

Meziane · Answer 2 · 2018-12-19T20:53:04.537

0

I am also new to python, but if I do understand your question, it's very easy: in the folowing script we use 5 workers to get the square of the first 10000 numbers.

import multiprocessing

globalData = range(10000) #very large data 

def f(x):
  return x*x

if __name__ == '__main__':
    pool = multiprocessing.Pool(5)
    print(pool.map(f,globalData))

edited Dec 19 '18 at 20:53

answered Dec 19 '18 at 11:57

Meziane

1,586
1
12
22

The problem is that the globalData has to be loaded from an external storage which takes time, so if it resides in memory I can just copy it to child processes – ammcom Dec 19 '18 at 19:34
If you have to process a huge number of files look at this: https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a If you want to process a single huge file than look at this https://stackoverflow.com/questions/11196367/processing-single-file-from-multiple-processes or this https://stackoverflow.com/questions/42404292/best-way-to-perform-multiprocessing-on-a-large-file-python or this https://stackoverflow.com/questions/28641059/python-process-a-large-text-file-in-parallel – Meziane Dec 20 '18 at 07:53

Python mutiprocessing with big shared data

2 Answers2