I'm looking into ray to distribute many computation tasks that could run in parallel. Ray code looks like:
ray.init(f"ray://{head_ip}:10001")
@ray.remote
def compute(compute_params):
# do some long computation
return some_result
# ==== driver part
ray_res = [compute.remote(cp) for cp in many_computations]
remote_res=ray.get(ray_res)
What is the proper way to stop such computation?
Suppose every computation might take a couple of hours, and for some reason, the driver code is killed/stopped/crashed, how is that possible to stop the tasks on the worker machines? Maybe to have some special configuration for workers that will understand that driver is dead...?