3

In GCP we can see the pipeline execution graph. Is the same possible when running locally via DirectRunner?

Dimon Buzz
  • 1,130
  • 3
  • 16
  • 35

2 Answers2

6

You can use pipeline_graph and the InteractiveRunner to get a graphviz representation of your pipeline locally.

An example for the word count pipeline used in the Beam documentation:

import apache_beam as beam
from apache_beam.runners.interactive.display import pipeline_graph
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import re

pipeline = beam.Pipeline(InteractiveRunner())
lines = pipeline | beam.Create([f"some_file_{i}.txt" for i in range(10)])

# Count the occurrences of each word.
counts = (
    lines
    | 'Split' >> (
        beam.FlatMap(
            lambda x: re.findall(r'[A-Za-z\']+', x)).with_output_types(str))
    | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
    | 'GroupAndSum' >> beam.CombinePerKey(sum))

# Format the counts into a PCollection of strings.
def format_result(word_count):
    (word, count) = word_count
    return f'{word}: {count}'

output = counts | 'Format' >> beam.Map(format_result)

# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | beam.io.WriteToText("some_file.txt")

print(pipeline_graph.PipelineGraph(pipeline).get_dot())

This prints

digraph G {
node [color=blue, fontcolor=blue, shape=box];
"Create";
lines [shape=circle];
"Split";
pcoll4978 [label="", shape=circle];
"PairWithOne";
pcoll8859 [label="", shape=circle];
"GroupAndSum";
counts [shape=circle];
"Format";
output [shape=circle];
"WriteToText";
pcoll6409 [label="", shape=circle];
"Create" -> lines;
lines -> "Split";
"Split" -> pcoll4978;
pcoll4978 -> "PairWithOne";
"PairWithOne" -> pcoll8859;
pcoll8859 -> "GroupAndSum";
"GroupAndSum" -> counts;
counts -> "Format";
"Format" -> output;
output -> "WriteToText";
"WriteToText" -> pcoll6409;
}

Putting this into https://edotor.net results in:

beam pipeline

You can work with GraphViz in Python to produce a prettier output if needed (graphviz for example).

Daniel T
  • 531
  • 2
  • 5
0

You can also use Python's RenderRunner, e.g.

python -m apache_beam.examples.wordcount --output out.txt \
    --runner=apache_beam.runners.render.RenderRunner \
    --render_output=pipeline.svg

This also has an interactive mode, triggered by passing --port=N (where 0 can be used to pick an unused port) which vends the graph as a local web service. This allows one to expand/collapse composites for easier exploration. Any --render_output arguments that are passed will get re-rendered as you edit the graph. (It uses graphviz under the hood, so can render any of those supported formats.)

Rendered Graph

For rendering non-Python pipelines, one can start this up as a local portable "runner."

python -m apache_beam.runners.render

and then "submit" this job from your other SDK over the provided jobs API endpoint via a portable runner to view it.

robertwb
  • 4,891
  • 18
  • 21