I encountered a few related questions on SO but there didn't seem to be a sufficient answer.
I'm trying to access GPU/CUDA from a docker container that has all the necessary prerequisites installed. I have not managed to get torch.cuda.is_available()
to return True, although I've checked through many other offered solutions.
For context, I would like to use Docker to access serverless GPU API setup at runpod. The image I'm currently testing on is a base nvidia cuda 11.8 image. The same problem arose from using a runpod pytorch image. I am running this with --gpus all
and I've also tried stuff like the answers in this SO question; I have installed nvidia-container-runtime, tested with nvidia-docker, and it is the case that everything does work locally (torch can see cuda on the host machine).
Some assessments of the situation in the docker container:
cuda.is_available() but cuda.device_count() == 1 driver and nvcc both available in docker image torch.utils.collect_env shows no cuda available but runtime version OK?
I've spent a couple days trying to debug this. I'm wondering where else I should look for resources on this?