1

I am trying to run a distributed computation using Dask on a AWS Fargate cluster (using dask.cloudprovider API) and I am running into the exact same issue as this question. Based on the partial answers to the linked question, and on things like this, I heavily suspect it is due to the pandas version in my worker being outdated; and indeed the official Dask Dockerfile specifies a old-ish version of pandas.

By contrast, when I run my computation locally (using a distributed.LocalCluster) with a pandas version at 1.2.2 it works fine. Btw, it is a call to the categorize method on a Dask DataFrame that triggers the error in the Fargate cluster case.

What I would like to do as a workaround is simply to specify myself the version of pandas in the image deployed to the workers, either building a custom image myself + putting it on an image repo + have the worker use it, or through some other method. Is there a way to achieve this?

1 Answers1

2

One option that might work is to pass environment variables, such as EXTRA_CONDA_PACKAGES and EXTRA_PIP_PACKAGES, to indicate the package versions you would like to install, which it looks like should be supported by dask.cloudprovider as seen here, and also noted in the dask-docker repo you linked. They would be passed as a dict via a parameter environment.

Another option would be to build and push your own image as you mentioned, which also appears to be supported by dask.cloudprovider as indicated here. The image tag would be passed to the cluster constructor via the image parameter.

The options linked are for ECSCluster, which FargateCluster inherits from, as seen here.

Brian Larsen
  • 612
  • 8
  • 9