14

I am running a program which loads 20 GB data to the memory at first. Then I will do N (> 1000) independent tasks where each of them may use (read only) part of the 20 GB data. I am now trying to do those tasks via multiprocessing. However, as this answer says, the entire global variables are copied for each process. In my case, I do not have enough memory to perform more than 4 tasks as my memory is only 96 GB. I wonder if there is any solution to this kind of problem so that I can fully use all my cores without consuming too much memory.

Community
  • 1
  • 1
Warren
  • 991
  • 11
  • 28
  • Why are you loading that much data into memory? What are you doing with the data? Ideally, you'd want to be loading data in the smallest possible chunks. – sytech Oct 24 '16 at 15:19
  • @sytech It is the nature of my task requires. I have been thinking of loading only the data needed in each process, but this will lead to loading the same data multiple times. – Warren Oct 24 '16 at 15:22
  • 1
    So each process needs to process the whole data, but does its own thing i.e no possibility for duplication of effort? In that case, you can try sharing the data in a `Manager.dict()` [here](https://docs.python.org/2/library/multiprocessing.html#managers) or a `Manager.list()` and spawn multiple `Processes`. If they're all doing the same task then you could chunk the data and hand each process its own chunk. I don't think you can usefully spawn more processes than cores though and it seems you want `>1000` processes? – roganjosh Oct 24 '16 at 15:22
  • Are you doing this on Windows? In linux, depending on how you write your code, there is no need for a copy to be sent to the child processes. – tdelaney Oct 24 '16 at 15:29
  • @roganjosh thanks, each task only consumes part of the 20GB data but kind of randomly. In this case only 20GB memory is used if I put them into Manager.dict? I have 24 threads and I can do those 1000 jobs sequentially by using 24 processes at the same time? – Warren Oct 24 '16 at 15:30
  • Are you able to chunk the 20GB and give each process a chunk, or does each process need the entire 20GB of data to work? Take a look at [this question](http://stackoverflow.com/questions/659865/python-multiprocessing-sharing-a-large-read-only-object-between-processes) which addresses this same kind of problem. – sytech Oct 24 '16 at 15:32
  • @tdelaney I am doing it on Linux server and I think all the global variables are copied to each process if I do the tasks via multiprocessing. – Warren Oct 24 '16 at 15:33
  • 1
    It allows all child processes access to a single object, yes, rather than giving them all their own copy. – roganjosh Oct 24 '16 at 15:34
  • 2
    @whan - no, its the opposite. The child gets a copy-on-write view of the parent memory space. As long as you load the dataset before firing the processes and you don't pass a reference to that memory space in the multiprocessing call (that is, workers should use the global variable directly), then there is no copy. – tdelaney Oct 24 '16 at 15:35
  • @sytech it is difficult to chunk the 20 GB data even each task only use a small part of it. – Warren Oct 24 '16 at 15:38
  • 1
    @tdelaney but that is only on Linux? I could have sworn this kind of thing failed for me on Windows but I was also trying to modify a single object and everything I read indicated that each child got a copy (regardless of OS). `Manager` was the fix for me, whether I needed read or write access. I'll have to go and play with that approach again on Linux. – roganjosh Oct 24 '16 at 15:38
  • it is quite mysterious to me actually, given the post I mentioned in the question, @senderle says the global variables are copied. However in my case I got Broken pipe error which I have not found any answer yet and I am not 100% sure it is about memory copying. Well if I do similar thing with 1 GB data, everything seems to be fine. – Warren Oct 24 '16 at 15:44
  • 1
    @roganjosh - in linux, forked children get a copy-on-write view of the parent. If you modify the data, it won't be seen by the parent. I gave an example in my answer. – tdelaney Oct 24 '16 at 15:55
  • 1
    @whan - that linked question was about returning results which do have to be copied, even in the linux case. You can step around the pickle problem by carving down the result set to picklable items (see pickle ref for that). That question also had another mistake at `j = multiprocessing.Process(target=getDV04CclDrivers, args=('LORR', dataDV04))` ... that is, even though `dataDV04` is global, he passed it as a parameter anyway and so it needed to be pickled, even in the linux case. – tdelaney Oct 24 '16 at 16:31
  • @tdelaney - I think your comment is no longer valid. Even if you pass the object in the parent memory space as an argument in the multiprocessing call, it will still follow copy-on-write semantics. https://stackoverflow.com/questions/67340860/how-are-parent-process-global-variables-copied-to-sub-processes-in-python-multip/67340892?noredirect=1#comment119049166_67340892 – figs_and_nuts May 02 '21 at 11:33

1 Answers1

10

In linux, forked processes have a copy-on-write view of the parent address space. forking is light-weight and the same program runs in both the parent and the child, except that the child takes a different execution path. As a small exmample,

import os
var = "unchanged"
pid = os.fork()
if pid:
    print('parent:', os.getpid(), var)
    os.waitpid(pid, 0)
else:
    print('child:', os.getpid(), var)
    var = "changed"

# show parent and child views
print(os.getpid(), var)

Results in

parent: 22642 unchanged
child: 22643 unchanged
22643 changed
22642 unchanged

Applying this to multiprocessing, in this example I load data into a global variable. Since python pickles the data sent to the process pool, I make sure it pickles something small like an index and have the worker get the global data itself.

import multiprocessing as mp
import os

my_big_data = "well, bigger than this"

def worker(index):
    """get char in big data"""
    return my_big_data[index]

if __name__ == "__main__":
    pool = mp.Pool(os.cpu_count())
    for c in pool.imap_unordered(worker, range(len(my_big_data)), chunksize=1):
        print(c)

Windows does not have a fork-and-exec model for running programs. It has to start a new instance of the python interpreter and clone all relevant data to the child. This is a heavy lift!

tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • Aha. Ok, the key to me understanding what was going on was the copy-on-write; several things feel contradictory when reading about `mp` if you don't have more in-depth understanding. Thanks for that explanation, upvote. Out of curiosity, I'm still forced to use `Managers` because of Windows; in my case, would that first create another 20GB object (so original + `Manager` = 40GB) in order to initialise the shared resource? The docs suggest that processes access the shared list/dict via a proxy to the _Manager_ but I've never worked with big enough data in `mp` to test. – roganjosh Oct 24 '16 at 16:04
  • thanks @tdelaney, is the conclusion depends on python versions (2.7 for my case) and the way to implement (pool.map, imap, imap_unordered)? – Warren Oct 24 '16 at 16:05
  • 1
    @whan - `map` and `imap` both wait for all processing to complete and return results in the order they were submitted. If you don't care about return order, `imap_unordered` is a more efficient and will use less memory if the results are large. Multiprocessing is much the same in python 2.x and 3.x. `map` and `imap_unordered` are both good options depending on whether you want ordered results. – tdelaney Oct 24 '16 at 16:19
  • 1
    @roganjosh - I'm not sure how `Manager` is implemented but those proxy objects seem like a heavy lift to me! In windows, if you can use shared memory or memory mapped files, I think you are better off. But that means things like lists (which are scattered all over memory) are a problem. Philosophically, windows prefers threads to light-weight subprocesses but that contradicts the python GIL model. – tdelaney Oct 24 '16 at 16:21