0

I am training transformers model on differtnt GPUs(3 gpus out of 8) and want to kill training on spesfic gpus only (0,6,7) I trained top command I can see only PID enter image description here. But don't know which GPUs belong to PID THE kill -9 I do not want to use because don't know which GPU will stop as I want to stop (0,7,6) and keep the others running

I reproduce the problem with a small example :

from accelerate import Accelerator, notebook_launcher
from accelerate.utils import set_seed

def training_loop():
    set_seed(42)
    accelerator = Accelerator(mixed_precision="fp16")
    print("Hello There!")
    # main()   
notebook_launcher(training_loop(),  num_processes=2) #training_loop(),

lunching the script with termonal :

CUDA_VISIBLE_DEVICES=0,6,7

python3 AccelerateTrainer.py

I expect after running Nvidia-smi 0% for both 0,6, and 7 GPUs

Mohammed
  • 346
  • 1
  • 12

2 Answers2

1

Below the nvidia-smi table you should have a list of processes. If you are in a docker container, you need to exit it to see the list of PID. From there you will be able to target which process is taking up memory and kill it.

enter image description here

Ivan
  • 34,531
  • 8
  • 55
  • 100
0

I found this Linux command that can list all the past or current starting processes:

ps -eo pid,lstart,cmd -u user_name | grep -i python3 And then kill specific GPUs by following the command after I know which script running on specific GPUs kill -9 <process_id>

Mohammed
  • 346
  • 1
  • 12