The following error(s) and solution go for deploying a stack through YAML in portainer but they can surely be applied to docker otherwise.
Environment:
PYTORCH="1.8.0"
CUDA="11.1"
CUDNN="8"
GPUs: Geforce RTX 3090
When trying to train a model with a single GPU, a shared memory size out of bounds error is thrown.
Also, when I used more GPUs (4), I got a different error, namely
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
However, if you enable debugging of NCCL, you will notice that at its root, it's actually a shared memory size error.