3

I have a python code that uses ray. It works locally on my mac, but once I try to run it inside a local docker container I get the following:

A warning:

WARNING services.py:1922 -- WARNING: The object store is using /tmp instead of /dev/shm
because /dev/shm has only 67108864 bytes available. This will harm performance! You may
be able to free up space by deleting files in /dev/shm. If you are inside a Docker
container, you can increase /dev/shm size by passing '--shm-size=2.39gb' to 'docker run'
(or add it to the run_options list in a Ray cluster config). Make sure to set this to
more than 30% of available RAM.

after the warning it says: INFO worker.py:1528 -- Started a local Ray instance.

and a few seconds later I get this error:

core_worker.cc:179: Failed to register worker
01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError:
[RayletClient] Unable to register worker with raylet. No such file or directory

I already tried:

  1. increasing the /dev/shm as explained
  2. limit the number of cpus in the ray.init() command (as mentioned here)
  3. use ray.init(_plasma_directory = 'dev/shm/')

I use version 2.1.0 of ray.

The first line of my dockerfile: FROM --platform=linux/amd64 python:3.10.9-slim-bullseye (without the --platform I can't pip install ray)

Any ideas what can I do to solve it? Thanks for you help

HagaiA
  • 193
  • 3
  • 15
  • Do you have any logs from /tmp/ray/session_latest/logs/raylet.out & raylet.err & gcs_server.out & gcs_server.err when the hang happens? – Sang Dec 23 '22 at 00:21
  • Sure. raylet.out: `agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.` – HagaiA Dec 25 '22 at 07:56
  • raylet.err: `(raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.` – HagaiA Dec 25 '22 at 08:00
  • dashboard_agent.log: `Raylet is terminated: ip=172.17.0.2, id=e82b6240f65d828ebc69a5a3677a5ef4da07a9fb41a0c321175a6d3e. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals` – HagaiA Dec 25 '22 at 08:00
  • gcs_server.out doesn't show any errors. gcs_server.err is empty – HagaiA Dec 25 '22 at 08:04
  • 1
    Looks like the issue is the ray couldn't even start. What's your grpc version? It should be <= 1.49 – Sang Jan 10 '23 at 03:53
  • I tried setting it to 1.49.1 and 1.48.0 and with both the outcome is still the same. I also tried using python 3.9.16 and it didn't matter as well – HagaiA Jan 10 '23 at 07:24
  • I see that I didn't mention it yet, I also get the following warning message: `WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested`. Maybe it can help solving this issue – HagaiA Jan 10 '23 at 07:26
  • Ah, ray currently doesn't have a native support for arm64 (I think we are planning to have it from the next release). If you use arm64, you may need to build wheel on your own https://discuss.ray.io/t/arm64-support-ci-integration/1893/13 – Sang Jan 12 '23 at 02:11

0 Answers0