This should be a simple one.
I have a huge dataset and I need to run a simulation multiple times, going through this dataset over and over again, read-only. I wanted to run these simulations in parallel and as I can't have this dataset loaded in every process (it's over 5GB), I wanted to use Ray "Shared Memory" functionality (I could try multiprocessing as well but Ray seemed to be easier and faster).
The code below is basically a copy from most examples I could find about it.
def run_simulation_parallel():
proc_list = []
list_id = ray.put(huge_list) # 5GB+ list, every position has a dictionary
for i in range(10):
proc_list.append(simulation.remote(i, list_id)) # create multiple processes
results = ray.get(proc_list)
@ray.remote
def simulation(i, list_id):
time.sleep(60) # do nothing, just keep the process alive
return
When I run the code above, I can see through task manager that every new process is building up to 5GB+, meaning it's loading the whole dataset multiple times.
I've seen people saying this is the intended use case for Ray (e.g. Shared-memory objects in multiprocessing, Robert Nishihara answer). So this should be possible, but every example is the same as my code. What am I missing here?
Using python 3.9, pycharm, windows 11.
Edit: I tried replacing the dataset (list of dictionaries) with a simple array full of ones, now the processes are not consuming as much RAM as the main one. Can Ray really store objects that are not array in shared memory?