I was starting some multiprocessing job that ran out of memory. I could see (online in a sagemaker container) that it raised an OSError, but the overall process was not terminated. I am unsure what to blame (sagemaker, docker, multiprocessing itself). What could cause such error not to "process" upwards?
Asked
Active
Viewed 357 times
1 Answers
0
Make sure your main process exists when a sub-process fail:
When SageMaker runs your container, for example as part of a Training job, it starts the container and waits for the container to exit. It has no knowledge of what your processes are doing.
To have the container exit when one of the sub-processes fail, make sure your main process detect this case and exits.
Note: A container’s main running process is the ENTRYPOINT and/or CMD at the end of the Dockerfile - in the case of bring your own script training it will be your train.py.

Gili Nachum
- 5,288
- 4
- 31
- 33
-
In my case it's the main python script (in a processor container) that does some multiprocessing via a standard `pool`. In this case you would expect it to be aware no? – Roelant Dec 07 '21 at 18:01
-
The container needs to exit, which will only happen when the main process exist, which is up to you. – Gili Nachum Dec 13 '21 at 23:42