Parallelisation on Ubuntu using Ray

Question

I've been using ray to parallelise my code on a remote linux server. The jobs stop after a while with the following error:

ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2021-08-19 08:39:21,246 WARNING worker.py:1189 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: c2ac2060eccbb2f78749315d34dda4c52ed9dbf9f1b576b3 Worker ID: a8bbd438a7e16a4de793a02757a8f917668b6cdc2a69a4573b0b9544 Node ID: 4c0199e71ecf4dc12e46ac03e26a9919308ec31899a0fd1dcc93c063 Worker IP address: 134.58.41.155 Worker port: 38967 Worker PID: 4038515

Digging a bit deeper I find this in the log files of one of the workers:

*** SIGFPE received at time=1629355161 on cpu 6 ***
(pid=4038515) PC: @     0x7f7570f6e5d4  (unknown)  mpz_manager<>::machine_div()
(pid=4038515)     @     0x7f7f09f77420  (unknown)  (unknown)
(pid=4038515)     @     0x7ffc84f0c350  (unknown)  (unknown)
(pid=4038515)     @ ... and at least 1 more frames

I face the same problem if I use other parallelisation libraries like Dask or Scoop. I've also tried on google cloud servers and the problem remains the same.

Interestingly when I run the same code with exact same parallelisation on my local Mac machine, the code runs fine.

Any pointers would be much appreciated!

Thanks

There is no way for anyone to reproduce your problem if you do not include any code. — suvayu, Aug 19 '21 at 14:47
Please share an [mcve](https://stackoverflow.com/help/minimal-reproducible-example). — pavithraes, Oct 07 '21 at 12:57

Parallelisation on Ubuntu using Ray

0 Answers0