8

I'm running ray on EC2. I am running workers on c5.large instances, which have ~4G of RAM.

When I run many jobs, I see these error messages:

  File "python/ray/_raylet.pyx", line 631, in ray._raylet.execute_task
  File "/home/ubuntu/project/env/lib/python3.6/site-packages/ray/memory_monitor.py", line 126, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-43-111 is used (3.47 / 3.65 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
21183   0.21GiB ray::IDLE
21185   0.21GiB ray::IDLE
21222   0.21GiB ray::IDLE
21260   0.21GiB ray::IDLE
21149   0.21GiB ray::IDLE
21298   0.21GiB ray::IDLE
21130   0.21GiB ray::IDLE
21148   0.21GiB ray::IDLE
21225   0.21GiB ray::IDLE
21257   0.21GiB ray::IDLE

In addition, up to 0.0 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.

I am running my ray task with memory = 2000*1024*1024 and max_calls=1, so there should never be more than 2 processes on the box at the same time.

What are these ray::IDLE processes and how can I stop my workers from going OOM?

Using ray 0.8.1

Henry Henrinson
  • 5,203
  • 7
  • 44
  • 76

3 Answers3

8

ray:IDLE are idle processes that are staying in the processing pool. (Ray does it so that it can reduce process startup time). Each of them takes around 0.21GB of memory because even idle processes need to use some memory (For example, it should run a python interpreter).

You can probably mitigate the problem by 2 things. 1. Set the num_cpus argument of ray_init to be lower (like 2~3) so that you will have only 2~3 processes available. 2. You should take into account of system memory. As you can see Ray is using memory not just for tasks but also for its system components such as raylet or idel processes. If your machine has 4GB memory and if 2 of your tasks are using 2GB of memory and scheduled in that machine, it will cause an OOM problem because there are extra processes that consume extra memory.

To avoid memory issues, you can either scale up your cluster (use a bigger machine or multiple machines), or reduce the memory usage of your task.

Sang
  • 885
  • 5
  • 4
0

You can limit the port numbers that workers are allowed to use: ray start --min-worker-port 10010 --max-worker-port 10011 for example would only allow two workers. Note that (as of ray 1.12) num-cpus does not limit the number of ray::IDLE workers.

  • Ray committer here - just wanted to add that it is not recommended to use worker port range for the purpose of limiting how many workers are started, since this can cause unexpected behavior like task delays and deadlock. There is no way to add a hard cap on the number of workers, but `--num-cpus` is the best method of adding a soft cap, and IDLE workers should get GCed by Ray eventually. – Stephanie Wang Sep 14 '22 at 05:31
  • The problem is that ray::IDLE's never go away. I have resorted to `ps -u fred | grep ray::IDLE | grep '00:0[5-9]:..' | awk '{print $1}' | kill -9` because the that seems to be the only alternative. – Joshua J. Cogliati Jan 13 '23 at 00:09
  • Hmm if you are seeing that the ray::IDLE processes never go away, this is most likely a bug. – Stephanie Wang Jan 14 '23 at 02:07
-1

Try ray.init(local_mode=True) to run ray in single process, it solved my low memory issue.

jazeb007
  • 578
  • 1
  • 5
  • 11
  • 2
    Please note that `local_mode=True` is only meant for debugging purposes and you will lose all the major benefits of Ray when running in this mode (parallel/distributed execution, shared memory, etc). – Stephanie Wang Sep 14 '22 at 05:32