I'll go through a bit of background on the "downscaling problem" and how to mitigate it. Note that this information applies to both manual downscaling as well as preemptible VMs getting preempted.
Background
Autoscaling removes nodes based on the amount of "available" YARN memory in the cluster. It does not take into account shuffle data on the cluster. Here's an illustration from a recent presentation we gave.
In a MapReduce-style job (Spark jobs are a set of MapReduce-style shuffles between stages), data from all mappers must get to all reducers. Mappers write their shuffle data to local disk, and then reducers fetch data from each mapper. There is a server on every node dedicated to serving shuffle data, and it runs outside of YARN. Therefore, a node can appear idle in YARN even though it needs to stay around to serve its shuffle data.

When a single node gets removed, pretty much all reducers will fail, since they all need to fetch data from every node. Reducers will specifically fail with FetchFailedException
(as you saw), indicating they were unable to get shuffle data from a particular node. The driver will eventually re-run necessary mappers, and then re-run the reduce stage. Spark is a bit inefficient (https://issues.apache.org/jira/browse/SPARK-20178), but it works.

Note that you can lose nodes in one of three scenarios:
- Intentionally removing nodes (autoscaling or manual downscaling)
- Preemptible VMs
getting preempted. Preemptible VMs get preempted at least every 24 hours.
- (Relatively rare) a standard GCE VM is
ungracefully terminated by GCE, and restarted. Usually, standard VMs
are transparently live migrated.
When you create an autoscaling cluster, Dataproc adds several properties to improve
job resiliency in the face of losing nodes:
yarn:yarn.resourcemanager.am.max-attempts=10
mapred:mapreduce.map.maxattempts=10
mapred:mapreduce.reduce.maxattempts=10
spark:spark.task.maxFailures=10
spark:spark.stage.maxConsecutiveAttempts=10
spark:spark.yarn.am.attemptFailuresValidityInterval=1h
spark:spark.yarn.executor.failuresValidityInterval=1h
Note that if you enable autoscaling on an existing cluster, it will not have these properties set. (But you can set them manually when creating a cluster).
Mitigations
1) Use graceful decommissioning
Dataproc integrates with YARN's Graceful Decommissioning, and can be set on Autoscaling Policies or manual downscale operations.
When gracefully decommissioning a node, YARN keeps it around until applications that ran containers on the node finish, but does not let it run new containers. That gives nodes an opportunity to serve their shuffle data before being removed.
You will need to ensure that your graceful decommission timeout is long enough to encompass your longest jobs. The autoscaling docs suggest 1h
as a starting point.
Note that graceful decommissioning only really makes sense only long-running clusters that process lots of short jobs.
On ephemeral clusters, you would be better off "right-sizing" the cluster from the start, or disabling downscaling unless the cluster is completely idle (set scaleDownMinWorkerFraction=1.0
).
2) Avoid preemptible VMs
Even when using graceful decommissioning, preemptible VMs will be periodically terminated through "preemptions". GCE guarantees preemptible VMs will get preempted within 24 hours, and preemptions on large clusters are very spread out.
If you are using graceful decommissioning, and the FetchFailedException
error messages include -sw-
, you are likely seeing fetch failures due to nodes being preempted.
You have two options to avoid using preemptible VMs:
1. In your autoscaling policy, you can set secondaryWorkerConfig
to have 0 min and max instances, and instead put all workers in the primary group.
2. Alternatively, you can keep using "secondary" workers, but set --properties dataproc:secondary-workers.is-preemptible.override=false
. That will make your secondary workers be standard VMs.
3) Long term: Enhanced Flexibility Mode
Dataproc's Enhanced Flexibility Mode is the long term answer to the shuffle problem.
The downscaling problem is caused by shuffle data getting stored on local disk. EFM will include new shuffle implementations that allow placing shuffle data on a fixed set of nodes (e.g. just primary workers), or on storage outside of the cluster.
That will make secondary workers stateless, which means they can be removed at any time. This makes autoscaling far more compelling.
At the moment, EFM is still in Alpha, and does not scale to real-world workloads, but look out for a production-ready Beta by the summer.