1

The following error(s) and solution go for deploying a stack through YAML in portainer but they can surely be applied to docker otherwise.

Environment:

PYTORCH="1.8.0"
CUDA="11.1"
CUDNN="8"
GPUs: Geforce RTX 3090

When trying to train a model with a single GPU, a shared memory size out of bounds error is thrown.

Also, when I used more GPUs (4), I got a different error, namely

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

However, if you enable debugging of NCCL, you will notice that at its root, it's actually a shared memory size error.

Bendemann
  • 735
  • 11
  • 31

1 Answers1

1

It seems that by default, the size of the shared memory is limited to 64mb. The solution to this error therefore, as shown in this issue is to increase the size of shared memory.

Hence, the first idea that comes to mind would be simply defining something like shm_size: 9gb in the YAML file of the stack. However, this might not work as shown for e.g in this issue.

Therefore, in the end, I had to use the following workaround (also described here, but poorly documented):

volumes:
      - data-transfer:/mnt/data_transfer
      - type: tmpfs
        target: /dev/shm
        tmpfs:
          size: 9000000000

For this to work, however, you should make sure that the version of the stack YAML file is in its latest (otherwise you might get a syntax error), e.g. 3.7. A complete stack YAML file:

version: '3.7'

services:
  mmaction2:
    shm_size: 256m # doesn't work
    image: something 
    tty: true
    volumes:
      - data-transfer:/mnt/data_transfer
      - type: tmpfs
        target: /dev/shm
        tmpfs:
          size: 9000000000 # ~9gb

volumes:
  # local
  data-transfer:
    driver: local
    name: data-transfer 
Bendemann
  • 735
  • 11
  • 31