Ray on slurm - Problems with initialization

Question

I write this post because since I use slurm, I have not been able to use ray correctly. Whenever I use the commands :

ray.init
trainer = A3CTrainer(env = “my_env”) (I have registered my env on tune)

, the program crashes with the following message :

core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The program works fine on my computer, the problem appeared with the use of Slurm. I only ask slurm for one gpu.

Thank you for reading me and maybe answering. Have a great day

Some precisions about the code

@Alex I used the following code :

import ray
from ray.rllib.agents.a3c import A3CTrainer
import tensorflow as tf
from MM1c_queue_env import my_env #my_env is already registered in tune

ray.shutdown()
ray.init(ignore_reinit_error=True)
trainer = A3CTrainer(env = "my_env")

print("success")

Both lines with trainer and init cause the program to crash with the error mentionned in my previous comment. To launch the program with slurm, I use the following program :

#!/bin/bash

#SBATCH --job-name=rl_for_insensitive_policies
#SBATCH --time=0:05:00 
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

module load anaconda3/2020.02/gcc-9.2.0
python test.py

Can you post additional details about how you're deploying ray on slurm? — Alex, Jun 01 '22 at 19:33
@Alex I added some precisions in the question. Thank you for answering — Pierre houdouin, Jun 02 '22 at 04:12
Can you add any relevant log information from `/tmp/ray/session_latest/logs` after running that script? Also any network/file system configurations on the slurm cluster that may be relevant? — Alex, Jun 02 '22 at 17:55

score 5 · Answer 1 · answered Jun 03 '22 at 16:41

Limit the number of CPUs

Ray will launch as many worker processes as your execution node has CPUs (or CPU cores). If that's more than you reserved, slurm will start killing processes.

You can limit the number of worker processes as such:

import ray
ray.init(ignore_reinit_error=True, num_cpus=4)
print("success")

score 2 · Answer 2 · answered Jul 21 '22 at 05:14

You can find the detailed instructions of running Ray with SLURM in the documentation. The below instruction is based on it. I used the information in this link too.

You should launch a process for head and launch as many processes as worker nodes you have. Then, the worker nodes must be connected to the head node.

#!/bin/bash
#SBATCH -p gpu
#SBATCH -t 00:05:00 
#SBATCH --job-name= 'rl_for_insensitive_policies'

--tasks-per-node must be one based on the documentation.

#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1

After specifying some resources, load your environment

module load anaconda3/2020.02/gcc-9.2.0

Then, you need to obtain the head ip address.

Getting the node names

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-
address)

if [[ "$head_node_ip" == *" "* ]]; then
  IFS=' ' read -ra ADDR <<<"$head_node_ip"
  if [[ ${#ADDR[0]} -gt 16 ]]; then
    head_node_ip=${ADDR[2]}
  else
    head_node_ip=${ADDR[0]}
  fi
  echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi

port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
redis_password=$(uuidgen)
echo "redis_password: "$redis_password

nodeManagerPort=6700
objectManagerPort=6701
rayClientServerPort=10001
redisShardPorts=6702
minWorkerPort=10002
maxWorkerPort=19999

The below code launches the head node.

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" \
        --port=$port \
        --node-manager-port=$nodeManagerPort \
        --object-manager-port=$objectManagerPort \
        --ray-client-server-port=$rayClientServerPort \
        --redis-shard-ports=$redisShardPorts \
        --min-worker-port=$minWorkerPort \
        --max-worker-port=$maxWorkerPort \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &

sleep 10

number of nodes other than the head node

worker_num=$((SLURM_JOB_NUM_NODES - 1))

The below loop launches some workers (one worker for each node).

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &
    sleep 5
done

it is better to add some argeparse arguments to your code so that you can give it the specified resources and the redis-password.

python test.py --redis-password $redis_password --num-cpus 
$SLURM_CPUS_PER_TASK --num-gpus $SLURM_GPUS_PER_TASK

if you get "unable to connect to GCS server" error , use the below values or use some new values. Two users cannot use same port.

port=6380
nodeManagerPort=6800
objectManagerPort=6801
rayClientServerPort=20001
redisShardPorts=6802
minWorkerPort=20002
maxWorkerPort=29999

in your test.py, add the arguments and initialize Ray

import ray
import argparse
parser = argparse.ArgumentParser(description="Script for training RLLIB
agents")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument("--num-gpus", type=int, default=0)
parser.add_argument("--redis-password", type=str, default=None)
args = parser.parse_args()

ray.init(_redis_password=args.redis_password, address=os.environ["ip_head"])

config["num_gpus"] = args.num_gpus
config["num_workers"] = args.num_cpus

score 0 · Answer 3 · answered Mar 17 '23 at 19:57

0

Same issue with me. Try to bring up a single raylet cluster first ray start --head, then use ray.init(address='auto) might help.

answered Mar 17 '23 at 19:57

Diya Li

1,048
9
21

Ray on slurm - Problems with initialization

Some precisions about the code

3 Answers3

Limit the number of CPUs

Linked