1

We have shared server with multiple GPU nodes without resource manager. We make agreements that: "this week you can use nodes ID1,ID2 and ID5". I have a program that gets this ID as a parameter.

When I need to run my program ten times with ten different sets of parameters $ARGS1, $ARGS2, ..., $ARGS10, I run first three commands

programOnGPU $ARGS1 -p ID1 &
programOnGPU $ARGS2 -p ID2 &
programOnGPU $ARGS3 -p ID5 &

Then I must wait for any of those to finish and if e.g ID2 finishes first, I then run

programOnGPU $ARGS4 -p ID2 &

As this is not very convenient when you have a lot of processes I would like to automatize the process. I can not use parallel as I need to reuse IDs.

First use case is a script that needs to execute apriori known 10 commands of the type

programOnGPU $PARAMS -p IDX

when any of them finishes to assign its ID to another one in the queue. Is this possible using bash without too much overhead of the type of the SLURM? I don't need to check the state of physical resource.

General solution would be if I can make a queue in the bash or simple command line utility to which I will submit commands of the type

programABC $PARAMS

and it will add the GPU ID parameter to it and manage the queue that will be preconfigured to be able to use just given IDs and one ID at once. Again I don't want this layer to touch physical GPUs, but to ensure that it executes consistently over allowed ID's.

VojtaK
  • 483
  • 4
  • 13
  • I don't understand your examples/needs very well, but if you want a simple way to reserve/assign resources amongst a load of people across a network, you can make something very low overhead, reliable, fast and responsive with **Redis** - see https://stackoverflow.com/a/39074276/2836621 and https://stackoverflow.com/a/22220082/2836621 – Mark Setchell Dec 15 '20 at 17:33
  • @MarkSetchell I have tried to improve question. In my case the problem is that the jobs needs to know the free ID before being executed. And when process of given ID finishes the management tool must figure out that given ID is free and pass it to another process. – VojtaK Dec 15 '20 at 17:58
  • 1
    Why do you think you can't use `parallel` for this? Just make it parallelize across the IDx array and read from a shared input. – tripleee Dec 15 '20 at 18:11
  • as tripleee mentions, if you've got `parallel` installed look at using the `-j #` option to limit the number of concurrent jobs; if you don't have (or cannot get) `parallel` installed, you can use a `while` loop to kick off 'N' jobs in parallel, then use `wait -n` to wait for 1 job to complete before kicking off next job in background (eg, [2nd half of this answer](https://stackoverflow.com/a/64697608)); if your `bash` does not support `wait -n` you can look at [this example of polling jobs output](https://stackoverflow.com/a/49743261) as a means of load balancing – markp-fuso Dec 15 '20 at 18:21
  • @VojtaK Wow, cool, so just do it - write such a system that will manage resources in the requested way. Why are you writing here? What is you question? If your question is `Is this possible using bash` - then yes, sure, it's possible. Just do it™ - learn bash and learn the tools and management and locking and write such an utility. If you want someone else to write such an utility for you, I suggest a freelancing site. – KamilCuk Dec 15 '20 at 19:03

2 Answers2

1

This is very simple with Redis. It is a very small, very fast, networked, in-memory data-structure server. It can store sets, queues, hashes, strings, lists, atomic integers and so on.

You can access it across a network in a lab, or across the world. There are clients for bash, C/C++, Ruby, PHP, Python and so on.

So, if you are allocated nodes 1, 2 and 5 for the week, you can just store those in a Redis "list" with LPUSH using the Redis "Command Line Interface"* for bash:

redis-cli lpush VojtaKsNodes 1 2 5

If you are not on the Redis host, add its hostname/IP-address into the command like this:

redis-cli -h 192.168.0.4 lpush VojtaKsNodes 1 2 5

Now, when you want to run a job, get a node with BRPOP. I specify an infinite timeout with the zero at the end, but you could wait a different amount of time:

# Get a node with infinite timeout
node=$(redis-cli brpop VojtaKsNodes 0)

run job on "$node"

# Give node back
redis-cli lpush VojtaKsNodes "$node"
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Thank you! This solution is so general that we decided to implement it for the management of the resources for multiple users. Still there are some drawbacks such as brpop prioritizing by the time of call so there is no way how to overcome the priority of the hang calls. And second for "normal users" it is quite overhead so I end up writing scripts for giveMeGPU and returnGPU based on this. I also found a way how to associate the user name with the GPU that is associated to him through hashes.. – VojtaK Dec 16 '20 at 23:18
  • Cool - glad it got you started. If you have developed a better, more complete solution, feel free to write it up as an answer and accept it - I'm not at all worried about losing the points Good luck with your project! – Mark Setchell Dec 17 '20 at 07:13
0

I would:

  • I have a list of IDS=(ID1 ID2 ID5)
  • I would make 3 files, one with each IDs.
  • Run <arguments xargs -P3 programOnGPUFromLockedFile so run 3 processes for each of your argument.
    • Each of the processes will nonblockingly try to flock the 3 files in a loop, endlessly (ie. you can run more processes then 3, if you wanna).
    • When they succeed to flock,
      • they read the ID from the file
      • run the action on that ID
      • When they terminate, they will free flock, so the next process may flock the file and use the ID.

Ie. it's a very, very basic mutex locking. There are also other ways you can do it, like with an atomic fifo:

  • Create a fifo
  • Spawn one process for each argument you want to run with that will:
    • Read one line from the fifo
    • That line will be the ID to run on
    • Do the job on that ID
    • Output one line with the ID to the fifo back
  • And then write one ID per line to the fifo (in 3 separate writes! so that it's hopefully atomic), so 3 processes may start.
  • wait until all except 3 child processes exit
  • read 3 lines from fifo
  • wait until all child processes exit
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • Thank you for your answer. I think that the question I have asked is pretty general so I expected that there is a tool for that. I am looking for the approaches how to do this so its on the users here if they are willing to share their answers or not. – VojtaK Dec 16 '20 at 23:21