0

I've been using ray to parallelise my code on a remote linux server. The jobs stop after a while with the following error:

ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2021-08-19 08:39:21,246 WARNING worker.py:1189 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: c2ac2060eccbb2f78749315d34dda4c52ed9dbf9f1b576b3 Worker ID: a8bbd438a7e16a4de793a02757a8f917668b6cdc2a69a4573b0b9544 Node ID: 4c0199e71ecf4dc12e46ac03e26a9919308ec31899a0fd1dcc93c063 Worker IP address: 134.58.41.155 Worker port: 38967 Worker PID: 4038515

Digging a bit deeper I find this in the log files of one of the workers:

*** SIGFPE received at time=1629355161 on cpu 6 ***
(pid=4038515) PC: @     0x7f7570f6e5d4  (unknown)  mpz_manager<>::machine_div()
(pid=4038515)     @     0x7f7f09f77420  (unknown)  (unknown)
(pid=4038515)     @     0x7ffc84f0c350  (unknown)  (unknown)
(pid=4038515)     @ ... and at least 1 more frames

I face the same problem if I use other parallelisation libraries like Dask or Scoop. I've also tried on google cloud servers and the problem remains the same.

Interestingly when I run the same code with exact same parallelisation on my local Mac machine, the code runs fine.

Any pointers would be much appreciated!

Thanks

0 Answers0