62

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code.

Can anyone explain what the _ , | , and >> are doing in the below code? Also is the text in quotes ie 'ReadTrainingData' meaningful or could it be exchanged with any other label? In other words how is that label being used?

train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data)
evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data)

input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema)

_ = (input_metadata
| 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(
       os.path.join(output_dir, path_constants.RAW_METADATA_DIR),
       pipeline=pipeline))

preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold)
(train_dataset, train_metadata), transform_fn = (
  (train_data, input_metadata)
  | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset(
      preprocessing_fn))
Augustin
  • 2,444
  • 23
  • 24
dobbysock1002
  • 907
  • 10
  • 15
  • 1
    The idea is to take a `.method()` syntax and turn it into an `infix` operator. For example taking something like `a.plus(b)` and making it possible to write syntax like `a + b` . Check out the `__or__` , `__ror__` , and `__rrshift__` function definitions in the source https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L448. So `MyPTransform | NextPTransform` is really taking both of those PTransform objects and passing them as a list `_parts` to a new `_ChainedPTransform` object, and if you `__or__` yet another PTransform, it flattens the nested lists. – Davos Oct 25 '18 at 13:16
  • 1
    `>>` aka `__rrshift__` is effectively setting the `label` attribute of the PTransform but it doesn't just do something like Ptransform.label('new name') or Ptransform.label = 'new name', it seems more convoluted to me. `'ReadEvalData' >> _ReadData(eval_data)` is evaluated as returning a new `_NamedPTransform` object with `_ReadData(eval_data)` initialized as the `self.transform` attribute and the string `'ReadEvalData'` is initialized as the label attribute by using `Super` to run the parent class `PTransform`'s init method. – Davos Oct 25 '18 at 13:28
  • 1
    `__ror__` is the same as `__or__`, they both end up as the pipe infix `|` but `__ror__` lets you define how to pipe other objects to a `PTransform` where those other objects don't have the pipe operator method defined. It kind of makes the pipe operator of the thing on the right still work from left-to-right. – Davos Oct 25 '18 at 13:30
  • 2
    Extremely convoluted, creating a new object with the current object as an attribute, just to get a functional / shell style flow going. It does make it easier to read the processing pipeline overall from left to right though, and looks similar to functional programming syntax like F# or using MagrittR in Rlang. It is probably better than a lot of nested function calls like `PTransforrm.apply(NextPTransform.apply(YetAnotherPTransform))` but then you can always create new variables for each step. After all, it is lazily evaluated and it is not doing deep copies of data so there's no penalty. – Davos Oct 25 '18 at 13:33

1 Answers1

84

Operators in Python can be overloaded. In Beam, | is a synonym for apply, which applies a PTransform to a PCollection to produce a new PCollection. >> allows you to name a step for easier display in various UIs -- the string between the | and the >> is only used for these display purposes and identifying that particular application.

See https://beam.apache.org/documentation/programming-guide/#transforms

Ben Chambers
  • 6,070
  • 11
  • 16
rf-
  • 1,443
  • 10
  • 15
  • Thank you! This is very helpful. If I'm understanding correctly then are they using the _ because there is no output PCollection for a writeMetadata transform? – dobbysock1002 May 06 '17 at 01:43
  • Notionally, it should return a PDone [1], which they don't need (so they use the throwaway _). [1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pvalue.py#L163 – rf- May 16 '17 at 22:11
  • @JoelCroteau: I guess they thought using `|` symbols - which are often pipe symbols (in shell for example) - for this would be so clever since you're building a pipeline ... – Gerard May 24 '19 at 08:36