0

I am submitting jobs to GCP AI Platform as

gcloud ai-platform jobs submit training "${job_name}" \
    --job-dir="gs://${job_root}/${job_name}" \
    --region=${region} \
    --master-image-uri=${image_uri} \
    --scale-tier=basic_gpu \
    --python-version=3.7 \
    -- \
    --other-args

But my custom image needs docker to run with --ipc=host. Is it possible to add this switch?

UPDATE: never mind the --ipc=host switch because as pointed out in the comment it shouldn't be needed (and it's not). Now I have to pass --device /dev/fuse --cap-add SYS_ADMIN to ai-platform's docker so that I can mount GCP bucket to docker image's file system with gcsfuse. Any ideas? Hacking gcloud is acceptable (by e.g., replacing part of the sources of gcloud inside the docker image), but I am not sure where the docker command is inside gcloud's source code.

Ziyuan
  • 4,215
  • 6
  • 48
  • 77
  • The service already increased the value for shm-size, so it should just work. Can you give a try please? Also we have official pytorch support, maybe you can simply use the service's official container images. – Guoqing Xu Sep 16 '20 at 17:12
  • @GuoqingXu Right, it's actually working, but just taking hours between model building and training starting for some reason, so that I thought it halted. For the image, I am training with [nnUNet](https://github.com/MIC-DKFZ/nnUNet), which requires PyTorch 1.6 and thus CUDA 10.1, and I think the official image has PyTorch up to 1.4 and CUDA up to 10.0? – Ziyuan Sep 16 '20 at 18:16
  • @GuoqingXu but is there a way to set the [switches](https://docs.docker.com/engine/reference/run/#operator-exclusive-options) generally? – Ziyuan Sep 16 '20 at 18:17
  • To look into this further, can you please give a complete example of how you would like this to work? – MrTech Sep 16 '20 at 19:41
  • 1
    We only provide knobs to enable ML training. For the issue you were facing, can you send your job id to cloudml-feedback@google.com please? The delay might be caused by lack of quota (we queue your jobs if you don't have enough quota). Here is the get started doc for pytorch FYI. https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch – Guoqing Xu Sep 17 '20 at 20:45
  • @MrTech An simplified version of the command is attached. As pointed out in my other comments I actually don't need that switch to make things work, but it's still good to know how to add general docker switches for `gcloud ai-platform jobs submit training`. – Ziyuan Sep 18 '20 at 14:01
  • @GuoqingXu never mind the delay, the jobs work pretty nice now. But currently I need to add `--device /dev/fuse --cap-add SYS_ADMIN` to ai-platform's docker so that I can utilize `gcsfuse` to mount GCP's bucket onto image's file system. Do you have ideas? (original question updated) – Ziyuan Sep 21 '20 at 21:18
  • Can you share the machine configs you are using for the training job? It would also be helpful if you can share your job id and config steps with cloudml-feedback@google.com for further investigation. – rpasricha Sep 22 '20 at 00:30

0 Answers0