What causes dask job failure with CancelledError exception

Question

I have been seeing below error message for quite some time now but could not figure out what leads to the failure.

Error:

concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)

On restart of dask cluster it runs successfully.

could you provide a [mcve](https://stackoverflow.com/help/minimal-reproducible-example) or some more context on your workflow to help diagnose the issue? Was @elukem 's answer below helpful? — rrpelgrim, Oct 04 '21 at 09:13

score 1 · Answer 1 · answered Jun 22 '21 at 07:57

If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.

The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.

What causes dask job failure with CancelledError exception

1 Answers1