Worker time-out with Dask Kubernetes on GKE

Question

I'm trying to get dask-kubernetes to work with my GKE account. The maddening thing is that it worked. But now it doesn't. I set up a cluster fine. The nodes get created fine as well. They run for 60 seconds and then time out with the following message (as shown with kubectl logs podname):

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/dask-worker", line 8, in <module>
    sys.exit(go())
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 446, in go
    main()
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 432, in main
    loop.run_sync(run)
  File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 426, in run
    await asyncio.gather(*nannies)
  File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 284, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

Which, I assume means that the workers can't connect to the scheduler which runs on my laptop? However I don't understand why. The port seems to be open.

from dask_kubernetes import KubeCluster
from dask.distributed import Client
import dask.array as da

if __name__ == '__main__':
    cluster = KubeCluster.from_yaml('worker-spec-2.yml')
    cluster.scale(1)
    client = Client(cluster)
    array = da.ones((1000, 1000, 1000))  
    print(array.mean().compute())

And the worker-spec-2.yml contains the following:

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 1GB, --death-timeout, '60']
    name: easyvvuq
    env:
      - name: EXTRA_PIP_PACKAGES
        value: git+https://github.com/dask/distributed
    resources:
      limits:
        cpu: "1"
        memory: 2G
      requests:
        cpu: 500m
        memory: 2G

Again, this or something similar has worked for me. I may have changed something in the worker-spec.yml but that is about it.

My question would - how do I go about diagnosing this? I am not a kubernetes expert by any means.

Did you deployed Dask using Helm as recommended in the documentation? https://docs.dask.org/en/latest/setup/kubernetes-helm.html — Will R.O.F., Jun 29 '20 at 16:01
No. I want to use a custom docker image based on the dask image. And I will need to ask users to create images based on said image. I think using helm will only make it more complicated. — orbitfold, Jun 30 '20 at 08:13

Worker time-out with Dask Kubernetes on GKE

0 Answers0