I am trying to run a distributed computation using Dask on a AWS Fargate cluster (using dask.cloudprovider
API) and I am running into the exact same issue as this question. Based on the partial answers to the linked question, and on things like this, I heavily suspect it is due to the pandas version in my worker being outdated; and indeed the
official Dask Dockerfile specifies a old-ish version of pandas.
By contrast, when I run my computation locally (using a distributed.LocalCluster
) with a pandas version at 1.2.2
it works fine. Btw, it is a call to the categorize
method on a Dask DataFrame that triggers the error in the Fargate cluster case.
What I would like to do as a workaround is simply to specify myself the version of pandas in the image deployed to the workers, either building a custom image myself + putting it on an image repo + have the worker use it, or through some other method. Is there a way to achieve this?