0

I'm looking into ray to distribute many computation tasks that could run in parallel. Ray code looks like:

ray.init(f"ray://{head_ip}:10001")

@ray.remote
def compute(compute_params):
    # do some long computation
    return some_result

# ==== driver part
ray_res = [compute.remote(cp) for cp in many_computations]
remote_res=ray.get(ray_res)

What is the proper way to stop such computation?

Suppose every computation might take a couple of hours, and for some reason, the driver code is killed/stopped/crashed, how is that possible to stop the tasks on the worker machines? Maybe to have some special configuration for workers that will understand that driver is dead...?

Julias
  • 5,752
  • 17
  • 59
  • 84

2 Answers2

0

Looking into RAY documentation

Remote functions can be canceled by calling ray.cancel on the returned Object ref. Remote actor functions can be stopped by killing the actor using the ray.kill interface.

In your example it would be something like:

ray.init(f"ray://{head_ip}:10001")

@ray.remote
def compute(compute_params):
    # do some long computation
    return some_result
    

# ==== driver part
ray_res = [compute.remote(cp) for cp in many_computations]

for x in ray_res:
    try:
        remote_res=ray.get(x)
    except TaskCancelledError:
        ....
    
running.t
  • 5,329
  • 3
  • 32
  • 50
  • I'm actually looking for a way to kill a task from driver or driver helper . I can call [ray.cancel.(r) for r in ray_res]. But if my code failed, how can I recreate ray_res to kill long-running tasks? – Julias Oct 18 '21 at 12:33
  • Instead of recreating, you should kill them before program exits. You [can't handle system SIGKILL or SIGSTOP](https://stackoverflow.com/questions/33242630/how-to-handle-os-system-sigkill-signal-inside-python) signal, but you can handle almost any other internal exception [including e.g SIGTERM](https://stackoverflow.com/questions/18499497/how-to-process-sigterm-signal-gracefully) etc. The simplest way is to surround **all** your code with big `try: ... except (BaseException, Exception):` block. – running.t Oct 18 '21 at 12:46
  • in my usecase, the driver is a bash script that is usually killed with kill -9. How can i handle that? – Julias Oct 18 '21 at 12:48
  • @Julias you can use kill -15 instead, that sends a `SIGTERM` signal which can be handled in Python. kill -9 sends a `SIGKILL` which cannot be handled. – Abhishek Divekar Mar 02 '23 at 03:46
0

Tasks are automatically killed when the driver exits in Ray. E.g., Try

Start ray using ray start --head

import ray    
import time
ray.init("auto")

@ray.remote
def f():
    time.sleep(300)

a = f.remote()
# At this point, task should be visible from `ray list tasks` or `ps aux | grep ray::f`
time.sleep(10)

and search ps aux | grep ray:: after the driver exits (all Ray workers are )

Sang
  • 885
  • 5
  • 4