There are many simple tutorials and also SO questions and answers out there which claim that Ray somehow shares data with the workers, but none of these go into the exact details of what gets shared how on which OS.
For example in this SO answer: https://stackoverflow.com/a/56287012/1382437 an np array gets serialised into the shared object store and then used by several workers all accessing the same data (code copied from that answer):
import numpy as np
import ray
ray.init()
@ray.remote
def worker_func(data, i):
# Do work. This function will have read-only access to
# the data array.
return 0
data = np.zeros(10**7)
# Store the large array in shared memory once so that it can be accessed
# by the worker tasks without creating copies.
data_id = ray.put(data)
# Run worker_func 10 times in parallel. This will not create any copies
# of the array. The tasks will run in separate processes.
result_ids = []
for i in range(10):
result_ids.append(worker_func.remote(data_id, i))
# Get the results.
results = ray.get(result_ids)
The ray.put(data)
call puts the serialised representation of the data into the shared object store and passes back a handle/id for it.
then when worker_func.remote(data_id, i)
is invoked, the worker_func
gets passed the deserialised data.
But what exactly happens in between? Clearly the data_id
is used to locate the serialised version of data and deserialise it.
Q1: When the data gets "deserialised" does this always create a copy of the original data? I would think yes, but I am not sure.
Once the data has been deserialised, it gets passed to a worker. Now, if the same data needs to get passed to another worker, there are two possibilities:
Q2: When an object that has already been deserialised gets passed to a worker, will it be via another copy or that exact same object? If it is the exact same object, is this using the standard shared memory approach to share data between processes? On Linux this would mean copy-on-write, so does this mean that as soon as the object is written to, another copy of it is created?
Q3: Some tutorials/answers seem to indicate that the overhead of deserialising and sharing data between workers is very different depending on the type of data (Numpy versus non-Numpy) so what are the details there? Why is numpy data shared more efficiently and is this still efficient when the client tries to write to that numpy array (which I think would always create a local copy for the process?) ?