Python Ray seems to copy objects for every remote function call

Question

This should be a simple one.

I have a huge dataset and I need to run a simulation multiple times, going through this dataset over and over again, read-only. I wanted to run these simulations in parallel and as I can't have this dataset loaded in every process (it's over 5GB), I wanted to use Ray "Shared Memory" functionality (I could try multiprocessing as well but Ray seemed to be easier and faster).

The code below is basically a copy from most examples I could find about it.

def run_simulation_parallel():
    proc_list = []
    list_id = ray.put(huge_list) # 5GB+ list, every position has a dictionary
    for i in range(10):
        proc_list.append(simulation.remote(i, list_id)) # create multiple processes
    results = ray.get(proc_list)

@ray.remote
def simulation(i, list_id):

    time.sleep(60) # do nothing, just keep the process alive
    return

When I run the code above, I can see through task manager that every new process is building up to 5GB+, meaning it's loading the whole dataset multiple times.

I've seen people saying this is the intended use case for Ray (e.g. Shared-memory objects in multiprocessing, Robert Nishihara answer). So this should be possible, but every example is the same as my code. What am I missing here?

Using python 3.9, pycharm, windows 11.

Edit: I tried replacing the dataset (list of dictionaries) with a simple array full of ones, now the processes are not consuming as much RAM as the main one. Can Ray really store objects that are not array in shared memory?

What column are you looking at in task manager? There are multiple memory usage coulmns. Is the RAM usage really going up 5GB x Number of processes? Because it could very well be that you are looking at the size of the virtual memory area of the processes, but that does not mean they are using all of that physical memory, part of it could be shared. — Marco Bonelli, Jul 06 '22 at 06:02
I'm just looking at 'Processes' tab, expanding Pycharm and checking for every new "Python" that pops up under it (the column description says 'physical memory in use by active processes'). But I`m pretty sure it's consuming new RAM as my pc lags a lot if I let it hit 99% usage — Cypheon, Jul 06 '22 at 06:17

score 3 · Accepted Answer · answered Jul 06 '22 at 22:49

3

Ray can support zero copy serialization when the object created by ray.put is numeric numpy array (See https://docs.ray.io/en/master/ray-core/objects/serialization.html#numpy-arrays), or other zero copyable object that supports pickle 5 out-of-band serialization (https://peps.python.org/pep-0574/).

Also note that although zero-copy is enabled, your total per-proc mem usage from top or htop includes shared memory usage. You can verify this by checking SHR column of htop from the per-proc memory usage. (So if your RES usage is 5 GB, and SHR is 4GB, the real mem usage is just 1 GB)

Also you can consider using Ray dataset to load data in zero-copyable format (https://docs.ray.io/en/master/data/dataset.html).

answered Jul 06 '22 at 22:49

Sang

885
5
4

Thanks a lot! When you say to "use Ray dataset to load data in zero-copyable format", you mean trying to format my data as a numeric numpy array, or using some other technique to load the current dataset in a way it's not copied by every process? I didn't find anything about it in the documentation. – Cypheon Jul 06 '22 at 23:41
Sorry I missed this comment. Everything should be automatically handled if you use IO APIs https://docs.ray.io/en/master/data/api/input_output.html#input-output. For example, if your data source is csv, this API https://docs.ray.io/en/master/data/api/input_output.html#csv should automatically handle zero-copy – Sang Jan 05 '23 at 07:47

Python Ray seems to copy objects for every remote function call

1 Answers1