2

Apache Beam runners can perform fusion to optimize the graph execution. However, in certain cases fusion may result in suboptimal performance. In Cloud Dataflow, this can be avoided by inserting a Reshuffle Transform or GroupByKey, which breaks fusion, as per Deploying-a-pipeline#fusion-optimization . Otherwise, the fusion of several PTransforms results in a fused Stage. Notably, a Reshuffle is stated to be deprecated, with the Question: (Apache Beam/Dataflow ReShuffle deprecated, what to use instead?) noting that Dataflow installs a ReshuffleOverrideFactory that essentially reduces the Reshuffle Transform to a windowed GroupByKey followed by iterable expansion.

In the event of a failure, Cloud Dataflow will retry the failed bundle (infinitely for streaming mode, and up to 4 times for batch mode), as per Deploying a Pipeline#error-and-exception-handling. The retry will only occur on the failing transform in the Apache Beam model, unless it is coupled (Coupled Failing)

Based on this understanding, I have executed the following Pipelines(simplified for conciseness):

  • Pipeline A: Read Files >> Parse Files (High Fan-out ParDo transform) >> [Write to BigQuery, Write to Pub/Sub]
  • Pipeline B: Read Files >> Parse Files (High Fan-out ParDo transform) >> [Reshuffle >> Write to BigQuery, Write to Pub/Sub]

Both pipelines are writing to a non-existent Pub/Sub topic to replicate an error scenario

For pipeline A, we observed the following:

  • our ParDo (which is parsing the files) is retrying as since it could not write to Pub/Sub.
  • system latency is increasing and there is no output to BigQuery as it internally contains a Reshuffle, thus every step prior to the Reshuffle's groupbykey operation is being retried as part of the fused stage, as we expected (The bundle has an element with an error).

For pipeline B, we observed the following:

  • our ParDo does not appear to be retrying
  • Interestingly, we are seeing data written to BigQuery, even though there are errors reported in the Dataflow Pub/Sub Metrics(and thus expecting a retry to process the file instead of even writing to BigQuery)

Based on my previous question, I understand only the outputs/side effects of a single successful retry will be observed downstream: How is exactly-once processing maintained during worker failures or bundle retries?. However, Pipeline A appears to be conforming to this observation, while Pipeline B is not. The only difference between these Pipelines is the addition of the Reshuffle PTransform, yet both exhibit different behaviour.

In summary, I guess I have the following question: How are retries handled in the context of a fused stage; in particular, fused stage with branching outputs? e.g. if there is a bundle retry, does it retry all branches or only the failing branch?

Seng Cheong
  • 442
  • 3
  • 14

0 Answers0