Background
Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.
Since ML is growing here. I am looking for a better way to make use of our infrastucture.
Requierments
- Multi-user. PhDs and students should be able to run their tasks.
- Job queue or scheduling (preferably something like time-sliced scheduling)
- Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
- Easy / Remote job submission: Maybe a webpage?
What I tried so far
I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.
First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.
SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.
In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.
My question
Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.
If you need more information, let me know.
Thanks Tim!