Setting up a multi-user job scheduler for data science / ML tasks

Question

Background

Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.

Since ML is growing here. I am looking for a better way to make use of our infrastucture.

Requierments

Multi-user. PhDs and students should be able to run their tasks.
Job queue or scheduling (preferably something like time-sliced scheduling)
Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
Easy / Remote job submission: Maybe a webpage?

What I tried so far

I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.

First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.

SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.

In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.

My question

Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.

If you need more information, let me know.

Thanks Tim!

Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer? — Dima Chubarov, Nov 28 '18 at 15:44
Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong? — Tim H, Dec 03 '18 at 13:46

score 2 · Accepted Answer · answered Nov 23 '18 at 16:09

As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.

There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757

I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.

This can be used in a following way:

apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs - name: digits-container image: nvidia/digits:6.0 resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs

That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.

I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.

Also you can read about Schedule GPUs which is still experimental.

Setting up a multi-user job scheduler for data science / ML tasks

1 Answers1