Questions tagged [nvidia-smi]

43 questions
11
votes
5 answers

Failed to initialize NVML: Unknown Error in Docker after Few hours

I am having interesting and weird issue. When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can't use gpus in docker. When I do nvidia-smi in docker machine. I see this…
Justin Song
  • 111
  • 1
  • 4
7
votes
3 answers

Can not find NVIDIA driver after stop and start a deep learning VM

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again. I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the…
zudi
  • 141
  • 1
  • 6
6
votes
0 answers

Given the number of parameters, how to estimate the VRAM needed by a pytorch model?

I am trying to estimate the VRAM needed for a fully connected model without having to build/train the model in pytorch. I got pretty close with this formula: # params = number of parameters # 1 MiB = 1048576 bytes estimate = params * 24 /…
RDlady
  • 378
  • 2
  • 16
3
votes
2 answers

What does the command "nvidia-smi --gpu-reset" do?

What does the command sudo nvidia-smi --gpu-reset -i 0 do? Is it just freeing up the memory of GPU?
user14889957
2
votes
0 answers

GPU is used by Xwayland in Docker image

I'm currently trying to use a docker image for training of a generative adversarial network. Unfortunately, when I try to run the skript, I get the following error: [2023-07-29 11:02:47 @__init__.py:80] Saving logging to file:…
1
vote
1 answer

nvidia-smi vs torch.cuda.memory_allocated

I am checking the gpu memory usage in the training step. To start with the main question, checking the gpu memory using the torch.cuda.memory_allocated method is different from checking with nvidia-smi. And I want to know why. Actually, I measured…
core_not_dumped
  • 759
  • 2
  • 22
1
vote
1 answer

Read GPU Information from Console C++

I want to create my own Overclocking Monitor for which I need to read information like the current voltage, clockspeeds and others. In C++ I can easily get the Information from Nvidia-smi with typing for example: console("nvidia-smi -q -i…
JackDerke
  • 11
  • 2
1
vote
2 answers

Nvidia driver is not recognized properly

OS:Ubuntu 20.04LTS Windows10 dual boot Error with nvidia-smi command after apt installation of nvidia driver. $ nvidia-smi Unable to determine the device handle for GPU 0000:0B:00.0: Not Found $ dmesg |grep NVRM [ 3.065144] NVRM: loading NVIDIA…
chess0000
  • 31
  • 1
  • 3
1
vote
0 answers

Why different GPUs use different amounts of memory?

I have 2 GPUs on different computers. One (NVIDIA A100) is on a server, the other (NVIDIA Quadro RTX 3000) is on my laptop. I watch the performance on both machines via nvidia-smi and noticed that the 2 GPUs use different amounts of memory when…
tnknepp
  • 5,888
  • 6
  • 43
  • 57
1
vote
0 answers

is there way to know which container is using which gpu device?

Let say I have a docker container is running A,B,C and GPU 1,2,3. I can check the gpu process ID with nvidia-smi some times container itself hold the gpu memory after it used up. so I want to find which gpu container is running which gpu and…
jakeE
  • 11
  • 2
1
vote
0 answers

Technique to measure GPU utilization over a given period of time

We run an HPC cluster with GPUs. We would like to report the overall GPU utilization for the job. I know I can do it by periodically sampling in the background and doing the math. I was wondering if there was a tool where I could basically start…
William Allcock
  • 134
  • 2
  • 9
1
vote
1 answer

watch command not working with special characters and quotes

watch -n 1 "paste <(ssh ai02 'nvidia-smi pmon -s um -c 1') <(ssh ai03 'nvidia-smi pmon -s um -c 1' )" The above command is used to horizontally stack two server GPU stats together. It works without the watch command but get the following error sh:…
JimmyJ
  • 41
  • 9
1
vote
1 answer

Most simplified form of the following regex / Extracting all values from nvidia-smi output

I am trying to analyze very large text string in Python containing nvidia-smi outputs but I really want to spend more time analyzing the data than working on my regex skills. I got the regex as follows but it takes forever in some rows (it might be…
0
votes
1 answer

StableLM answers too slow on GCP VM with GPU

I installed StableLM on a GCP VM with these specs: 1 x NVIDIA Tesla P4, 8 vCPU - 30 GB memory. And I set the model params llm_int8_enable_fp32_cpu_offload=True. But it takes too long to answer questions, ~8 minutes. It was faster even when using…
0
votes
0 answers

NVIDIA SMI shows lower CUDA version than NVCC

On an installation, when I run nvidia-smi, it shows the CUDA version as being 12.0. After installing the CUDA Toolkit, nvcc --version reports the version is 12.2. Is this a problem? Based on this very comprehensive answer, I understood that NVIDIA…
ahron
  • 803
  • 6
  • 29
1
2 3