2

We are currently using Dask Gateway with CPU-only workers. However, down the road when deep learning becomes more widely adopted, we want to transition into adding GPU support for the clusters created through Dask Gateway.

I've checked the Dask Gateway documentation, and there isn't so much in the way of detailed instruction on how to set this up and what parts of the helm chart/config we need to change to enable this functionality.

What I'm thinking is to first add a GPU to the GKE cluster on GCP, then use a RAPIDS dockerfile for the dask workers that utilizes this GPU? Is that all the set-up needed for Dask Gateway?

Would appreciate if someone could point me in the right direction.

Jacob Tomlinson
  • 3,341
  • 2
  • 31
  • 62
Riley Hun
  • 2,541
  • 5
  • 31
  • 77
  • I've added an answer for the main part of your question. I recommend you open a separate question around needing GPUs for SciKeras, Skorch, etc. – Jacob Tomlinson Dec 09 '20 at 10:08
  • Thanks a million @JacobTomlinson. I will definitely do that. My last question would be for our organization, do you recommend we have 2 Gateway instances: one with CPU compute and the other for GPU compute? Or can the gateway instance with GPU compute potentially replace the one with CPU compute? – Riley Hun Dec 09 '20 at 21:52
  • Honestly, I'm not sure. It would be good if Gateway had the concept of profiles, so you could choose different cluster configurations (perhaps you could raise a GitHub issue to propose this). I think today the easiest course of action is to have multiple gateways. – Jacob Tomlinson Dec 10 '20 at 15:55

1 Answers1

2

To run a Dask cluster on Kubernetes capable of GPU compute you need the following:

  • Kubernetes nodes need GPUs and drivers. This can be set up with the NVIDIA k8s device plugin.
  • Scheduler and worker pods will need a Docker image with NVIDIA tools installed. As you suggest the RAPIDS images are good for this.
  • The pod container spec will need GPU resources such as resources.limits.nvidia.com/gpu: 1
  • The Dask workers needs to be started with the dask-cuda-worker command from the dask_cuda package (which is included in the RAPIDS images).

Note: For Dask Gateway your container image also needs the dask-gateway package to be installed. We can configure this to be installed at runtime but it's probably best to create a custom image with this package installed.

Therefore here is a minimal Dask Gateway config which will get you a GPU cluster.

# config.yaml
gateway:
  backend:
    image:
      name: rapidsai/rapidsai
      tag: cuda11.0-runtime-ubuntu18.04-py3.8  # Be sure to match your k8s CUDA version and user's Python version

    worker:
      extraContainerConfig:
        env:
          - name: EXTRA_PIP_PACKAGES
            value: "dask-gateway"
        resources:
          limits:
            nvidia.com/gpu: 1  # This could be >1, you will get one worker process in the pod per GPU

    scheduler:
      extraContainerConfig:
        env:
          - name: EXTRA_PIP_PACKAGES
            value: "dask-gateway"
        resources:
          limits:
            nvidia.com/gpu: 1  # The scheduler requires a GPU in case of accidental deserialisation

  extraConfig:
    cudaworker: |
      c.ClusterConfig.worker_cmd = "dask-cuda-worker"

We can test things work by launching Dask gateway, creating a Dask cluster and running some GPU specific work. Here is an example where we get the NVIDIA driver version from each worker.

$ helm install dgwtest daskgateway/dask-gateway -f config.yaml
In [1]: from dask_gateway import Gateway

In [2]: gateway = Gateway("http://dask-gateway-service")

In [3]: cluster = gateway.new_cluster()

In [4]: cluster.scale(1)

In [5]: from dask.distributed import Client

In [6]: client = Client(cluster)

In [7]: def get_nvidia_driver_version():
   ...:     import pynvml
   ...:     return pynvml.nvmlSystemGetDriverVersion()
   ...: 

In [9]: client.run(get_nvidia_driver_version)
Out[9]: {'tls://10.42.0.225:44899': b'450.80.02'}
Jacob Tomlinson
  • 3,341
  • 2
  • 31
  • 62