We have shared server with multiple GPU nodes without resource manager. We make agreements that: "this week you can use nodes ID1,ID2 and ID5". I have a program that gets this ID as a parameter.
When I need to run my program ten times with ten different sets of parameters $ARGS1, $ARGS2, ..., $ARGS10
, I run first three commands
programOnGPU $ARGS1 -p ID1 &
programOnGPU $ARGS2 -p ID2 &
programOnGPU $ARGS3 -p ID5 &
Then I must wait for any of those to finish and if e.g ID2 finishes first, I then run
programOnGPU $ARGS4 -p ID2 &
As this is not very convenient when you have a lot of processes I would like to automatize the process. I can not use parallel
as I need to reuse IDs.
First use case is a script that needs to execute apriori known 10 commands of the type
programOnGPU $PARAMS -p IDX
when any of them finishes to assign its ID to another one in the queue. Is this possible using bash without too much overhead of the type of the SLURM? I don't need to check the state of physical resource.
General solution would be if I can make a queue in the bash or simple command line utility to which I will submit commands of the type
programABC $PARAMS
and it will add the GPU ID parameter to it and manage the queue that will be preconfigured to be able to use just given IDs and one ID at once. Again I don't want this layer to touch physical GPUs, but to ensure that it executes consistently over allowed ID's.