I'm trying to get dask-kubernetes to work with my GKE account. The maddening thing is that it worked. But now it doesn't. I set up a cluster fine. The nodes get created fine as well. They run for 60 seconds and then time out with the following message (as shown with kubectl logs podname
):
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/dask-worker", line 8, in <module>
sys.exit(go())
File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 446, in go
main()
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 432, in main
loop.run_sync(run)
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 426, in run
await asyncio.gather(*nannies)
File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 284, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
Which, I assume means that the workers can't connect to the scheduler which runs on my laptop? However I don't understand why. The port seems to be open.
from dask_kubernetes import KubeCluster
from dask.distributed import Client
import dask.array as da
if __name__ == '__main__':
cluster = KubeCluster.from_yaml('worker-spec-2.yml')
cluster.scale(1)
client = Client(cluster)
array = da.ones((1000, 1000, 1000))
print(array.mean().compute())
And the worker-spec-2.yml contains the following:
kind: Pod
metadata:
labels:
foo: bar
spec:
restartPolicy: Never
containers:
- image: daskdev/dask:latest
imagePullPolicy: IfNotPresent
args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 1GB, --death-timeout, '60']
name: easyvvuq
env:
- name: EXTRA_PIP_PACKAGES
value: git+https://github.com/dask/distributed
resources:
limits:
cpu: "1"
memory: 2G
requests:
cpu: 500m
memory: 2G
Again, this or something similar has worked for me. I may have changed something in the worker-spec.yml but that is about it.
My question would - how do I go about diagnosing this? I am not a kubernetes expert by any means.