I've been using ray to parallelise my code on a remote linux server. The jobs stop after a while with the following error:
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2021-08-19 08:39:21,246 WARNING worker.py:1189 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: c2ac2060eccbb2f78749315d34dda4c52ed9dbf9f1b576b3 Worker ID: a8bbd438a7e16a4de793a02757a8f917668b6cdc2a69a4573b0b9544 Node ID: 4c0199e71ecf4dc12e46ac03e26a9919308ec31899a0fd1dcc93c063 Worker IP address: 134.58.41.155 Worker port: 38967 Worker PID: 4038515
Digging a bit deeper I find this in the log files of one of the workers:
*** SIGFPE received at time=1629355161 on cpu 6 ***
(pid=4038515) PC: @ 0x7f7570f6e5d4 (unknown) mpz_manager<>::machine_div()
(pid=4038515) @ 0x7f7f09f77420 (unknown) (unknown)
(pid=4038515) @ 0x7ffc84f0c350 (unknown) (unknown)
(pid=4038515) @ ... and at least 1 more frames
I face the same problem if I use other parallelisation libraries like Dask or Scoop. I've also tried on google cloud servers and the problem remains the same.
Interestingly when I run the same code with exact same parallelisation on my local Mac machine, the code runs fine.
Any pointers would be much appreciated!
Thanks