GCP Dataflow 2.0 PubSub to GCS

Question

I'm having a difficult time understanding the concepts of .withFileNamePolicy of TextIO.write(). The requirements for supplying a FileNamePolicy seem incredibly complex for doing something as simple as specifying a GCS bucket to write streamed filed.

At a high level, I have JSON messages being streamed to a PubSub topic, and I'd like to write those raw messages to files in GCS for permanent storage (I'll also be doing other processing on the messages). I initially started with this Pipeline, thinking it would be pretty simple:

public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

        Pipeline p = Pipeline.create(options); 

        p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
            .apply("Write to GCS", TextIO.write().to(gcs_bucket);

        p.run();

    }

I got the error about needing WindowedWrites, which I applied, and then needing a FileNamePolicy. This is where things get hairy.

I went to the Beam docs and checked out FilenamePolicy. It looks like I would need to extend this class which then also require extending other abstract classes to make this work. Unfortunately the documentation on Apache is a bit scant and I can't find any examples for Dataflow 2.0 doing this, except for The Wordcount Example, which even then uses implements these details in a helper class.

So I could probably make this work just by copying much of the WordCount example, but I'm trying to better understand the details of this. A few questions I have:

1) Is there any roadmap item to abstract a lot of this complexity? It seems like I should be able to do supply a GCS bucket like I would in a nonWindowedWrite, and then just supply a few basic options like the timing and file naming rule. I know writing streaming windowed data to files is more complex than just opening a file pointer (or object storage equivalent).

2) It looks like to make this work, I need to create a WindowedContext object which requires supplying a BoundedWindow abstract class, and PaneInfo Object Class, and then some shard info. The information available for these is pretty bare and I'm having a hard time knowing what is actually needed for all of these, especially given my simple use case. Are there any good examples available that implement these? In addition, it also looks like I need the set the # of shards as part of TextIO.write, but then also supply # shards as part of the fileNamePolicy?

Thanks for anything in helping me understand the details behind this, hoping to learn a few things!

Edit 7/20/17 So I finally got this pipeline to run with extending the FilenamePolicy. My challenge was needing to define the window of the streaming data from PubSub. Here is a pretty close representation of the code:

public class ReadData {
    public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

        Pipeline p = Pipeline.create(options);

        p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
            .apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
            .apply("Write to GCS", TextIO.write().to("gcs_bucket")
                .withWindowedWrites()
                .withFilenamePolicy(new TestPolicy())
                .withNumShards(10));

        p.run();

    }
}

class TestPolicy extends FileBasedSink.FilenamePolicy {
    @Override
    public ResourceId windowedFilename(
        ResourceId outputDirectory, WindowedContext context, String extension) {
        IntervalWindow window = (IntervalWindow) context.getWindow();
        String filename = String.format(
            "%s-%s-%s-%s-of-%s.json",
            "test",
            window.start().toString(),
            window.end().toString(),
            context.getShardNumber(),
            context.getShardNumber()
        );
        return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
    }

    @Override
    public ResourceId unwindowedFilename(
        ResourceId outputDirectory, Context context, String extension) {
        throw new UnsupportedOperationException("Unsupported.");
    }
}

score 8 · Answer 1 · edited Aug 29 '17 at 10:08

In Beam 2.0, the below is an example of writing the raw messages from PubSub out into windowed files on GCS. The pipeline is fairly configurable, allowing you to specify the window duration via a parameter and a sub directory policy if you want logical subsections of your data for ease of reprocessing / archiving. Note that this has an additional dependency on Apache Commons Lang 3.

PubSubToGcs

/**
 * This pipeline ingests incoming data from a Cloud Pub/Sub topic and
 * outputs the raw data into windowed files at the specified output
 * directory.
 */
public class PubsubToGcs {

  /**
   * Options supported by the pipeline.
   * 
   * <p>Inherits standard configuration options.</p>
   */
  public static interface Options extends DataflowPipelineOptions, StreamingOptions {
    @Description("The Cloud Pub/Sub topic to read from.")
    @Required
    ValueProvider<String> getTopic();
    void setTopic(ValueProvider<String> value);

    @Description("The directory to output files to. Must end with a slash.")
    @Required
    ValueProvider<String> getOutputDirectory();
    void setOutputDirectory(ValueProvider<String> value);

    @Description("The filename prefix of the files to write to.")
    @Default.String("output")
    @Required
    ValueProvider<String> getOutputFilenamePrefix();
    void setOutputFilenamePrefix(ValueProvider<String> value);

    @Description("The shard template of the output file. Specified as repeating sequences "
        + "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the "
        + "shard number, or number of shards respectively")
    @Default.String("")
    ValueProvider<String> getShardTemplate();
    void setShardTemplate(ValueProvider<String> value);

    @Description("The suffix of the files to write.")
    @Default.String("")
    ValueProvider<String> getOutputFilenameSuffix();
    void setOutputFilenameSuffix(ValueProvider<String> value);

    @Description("The sub-directory policy which files will use when output per window.")
    @Default.Enum("NONE")
    SubDirectoryPolicy getSubDirectoryPolicy();
    void setSubDirectoryPolicy(SubDirectoryPolicy value);

    @Description("The window duration in which data will be written. Defaults to 5m. "
        + "Allowed formats are: "
        + "Ns (for seconds, example: 5s), "
        + "Nm (for minutes, example: 12m), "
        + "Nh (for hours, example: 2h).")
    @Default.String("5m")
    String getWindowDuration();
    void setWindowDuration(String value);

    @Description("The maximum number of output shards produced when writing.")
    @Default.Integer(10)
    Integer getNumShards();
    void setNumShards(Integer value);
  }

  /**
   * Main entry point for executing the pipeline.
   * @param args  The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory
        .fromArgs(args)
        .withValidation()
        .as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   * 
   * @param options The execution parameters to the pipeline.
   * @return  The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /**
     * Steps:
     *   1) Read string messages from PubSub
     *   2) Window the messages into minute intervals specified by the executor.
     *   3) Output the windowed files to GCS
     */
    pipeline
      .apply("Read PubSub Events",
        PubsubIO
          .readStrings()
          .fromTopic(options.getTopic()))
      .apply(options.getWindowDuration() + " Window", 
          Window
            .into(FixedWindows.of(parseDuration(options.getWindowDuration()))))
      .apply("Write File(s)",
          TextIO
            .write()
            .withWindowedWrites()
            .withNumShards(options.getNumShards())
            .to(options.getOutputDirectory())
            .withFilenamePolicy(
                new WindowedFilenamePolicy(
                    options.getOutputFilenamePrefix(),
                    options.getShardTemplate(),
                    options.getOutputFilenameSuffix())
                .withSubDirectoryPolicy(options.getSubDirectoryPolicy())));

    // Execute the pipeline and return the result.
    PipelineResult result = pipeline.run();

    return result;
  }

  /**
   * Parses a duration from a period formatted string. Values
   * are accepted in the following formats:
   * <p>
   * Ns - Seconds. Example: 5s<br>
   * Nm - Minutes. Example: 13m<br>
   * Nh - Hours. Example: 2h
   * 
   * <pre>
   * parseDuration(null) = NullPointerException()
   * parseDuration("")   = Duration.standardSeconds(0)
   * parseDuration("2s") = Duration.standardSeconds(2)
   * parseDuration("5m") = Duration.standardMinutes(5)
   * parseDuration("3h") = Duration.standardHours(3)
   * </pre>
   * 
   * @param value The period value to parse.
   * @return  The {@link Duration} parsed from the supplied period string.
   */
  private static Duration parseDuration(String value) {
    Preconditions.checkNotNull(value, "The specified duration must be a non-null value!");

    PeriodParser parser = new PeriodFormatterBuilder()
      .appendSeconds().appendSuffix("s")
      .appendMinutes().appendSuffix("m")
      .appendHours().appendSuffix("h")
      .toParser();

    MutablePeriod period = new MutablePeriod();
    parser.parseInto(period, value, 0, Locale.getDefault());

    Duration duration = period.toDurationFrom(new DateTime(0));
    return duration;
  }
}

WindowedFilenamePolicy

/**
 * The {@link WindowedFilenamePolicy} class will output files
 * to the specified location with a format of output-yyyyMMdd'T'HHmmssZ-001-of-100.txt.
 */
@SuppressWarnings("serial")
public class WindowedFilenamePolicy extends FilenamePolicy {

    /**
     * Possible sub-directory creation modes.
     */
    public static enum SubDirectoryPolicy {
        NONE("."),
        PER_HOUR("yyyy-MM-dd/HH"),
        PER_DAY("yyyy-MM-dd");

        private final String subDirectoryPattern;

        private SubDirectoryPolicy(String subDirectoryPattern) {
            this.subDirectoryPattern = subDirectoryPattern;
        }

        public String getSubDirectoryPattern() {
            return subDirectoryPattern;
        }

        public String format(Instant instant) {
            DateTimeFormatter formatter = DateTimeFormat.forPattern(subDirectoryPattern);
            return formatter.print(instant);
        }
    }

    /**
     * The formatter used to format the window timestamp for outputting to the filename.
     */
    private static final DateTimeFormatter formatter = ISODateTimeFormat
            .basicDateTimeNoMillis()
            .withZone(DateTimeZone.getDefault());

    /**
     * The filename prefix.
     */
    private final ValueProvider<String> prefix;

    /**
     * The filenmae suffix.
     */
    private final ValueProvider<String> suffix;

    /**
     * The shard template used during file formatting.
     */
    private final ValueProvider<String> shardTemplate;

    /**
     * The policy which dictates when or if sub-directories are created
     * for the windowed file output.
     */
    private ValueProvider<SubDirectoryPolicy> subDirectoryPolicy = StaticValueProvider.of(SubDirectoryPolicy.NONE);

    /**
     * Constructs a new {@link WindowedFilenamePolicy} with the
     * supplied prefix used for output files.
     * 
     * @param prefix    The prefix to append to all files output by the policy.
     * @param shardTemplate The template used to create uniquely named sharded files.
     * @param suffix    The suffix to append to all files output by the policy.
     */
    public WindowedFilenamePolicy(String prefix, String shardTemplate, String suffix) {
        this(StaticValueProvider.of(prefix), 
                StaticValueProvider.of(shardTemplate),
                StaticValueProvider.of(suffix));
    }

    /**
     * Constructs a new {@link WindowedFilenamePolicy} with the
     * supplied prefix used for output files.
     * 
     * @param prefix    The prefix to append to all files output by the policy.
     * @param shardTemplate The template used to create uniquely named sharded files.
     * @param suffix    The suffix to append to all files output by the policy.
     */
    public WindowedFilenamePolicy(
            ValueProvider<String> prefix, 
            ValueProvider<String> shardTemplate, 
            ValueProvider<String> suffix) {
        this.prefix = prefix;
        this.shardTemplate = shardTemplate;
        this.suffix = suffix; 
    }

    /**
     * The subdirectory policy will create sub-directories on the
     * filesystem based on the window which has fired.
     * 
     * @param policy    The subdirectory policy to apply.
     * @return The filename policy instance.
     */
    public WindowedFilenamePolicy withSubDirectoryPolicy(SubDirectoryPolicy policy) {
        return withSubDirectoryPolicy(StaticValueProvider.of(policy));
    }

    /**
     * The subdirectory policy will create sub-directories on the
     * filesystem based on the window which has fired.
     * 
     * @param policy    The subdirectory policy to apply.
     * @return The filename policy instance.
     */
    public WindowedFilenamePolicy withSubDirectoryPolicy(ValueProvider<SubDirectoryPolicy> policy) {
        this.subDirectoryPolicy = policy;
        return this;
    }

    /**
     * The windowed filename method will construct filenames per window in the
     * format of output-yyyyMMdd'T'HHmmss-001-of-100.txt.
     */
    @Override
    public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext c, String extension) {
        Instant windowInstant = c.getWindow().maxTimestamp();
        String datetimeStr = formatter.print(windowInstant.toDateTime());

        // Remove the prefix when it is null so we don't append the literal 'null'
        // to the start of the filename
        String filenamePrefix = prefix.get() == null ? datetimeStr : prefix.get() + "-" + datetimeStr;
        String filename = DefaultFilenamePolicy.constructName(
                filenamePrefix, 
                shardTemplate.get(), 
                StringUtils.defaultIfBlank(suffix.get(), extension),  // Ignore the extension in favor of the suffix.
                c.getShardNumber(), 
                c.getNumShards());

        String subDirectory = subDirectoryPolicy.get().format(windowInstant);
        return outputDirectory
                .resolve(subDirectory, StandardResolveOptions.RESOLVE_DIRECTORY)
                .resolve(filename, StandardResolveOptions.RESOLVE_FILE);
    }

    /**
     * Unwindowed writes are unsupported by this filename policy so an {@link UnsupportedOperationException}
     * will be thrown if invoked.
     */
    @Override
    public ResourceId unwindowedFilename(ResourceId outputDirectory, Context c, String extension) {
    throw new UnsupportedOperationException("There is no windowed filename policy for unwindowed file"
        + " output. Please use the WindowedFilenamePolicy with windowed writes or switch filename policies.");
    }
}

The code has changed a bit in Beam 2.2 after a [bug](https://issues.apache.org/jira/browse/BEAM-3169) filed concerning temp files. A good example for the [code](https://github.com/apache/beam/blob/761ec1af410f0ee153893f6e7082db85d0fdc3e7/examples/java/src/main/java/org/apache/beam/examples/common/WriteOneFilePerWindow.java) to use now. — Matthias Baetens, Jan 23 '18 at 15:08

score 1 · Answer 2 · answered Jul 18 '17 at 17:53

1

In Beam currently the DefaultFilenamePolicy supports windowed writes, so there's no need to write a custom FilenamePolicy. You can control the output filename by putting W and P placeholders (for the window and pane respectively) in the filename template. This exists in the head beam repository, and will also be in the upcoming Beam 2.1 release (which is being released as we speak).

answered Jul 18 '17 at 17:53

Reuven Lax

771
5
9

Hey Reuven, thanks for the info. I think I tried this previously and didn't have luck, when I looked at the docs it said its for un-windowed only. I'm also using Google Cloud Dataflow, and have been using the 2.0 release. Sounds like this is upcoming? https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/DefaultFilenamePolicy.html – Chris Schrader Jul 18 '17 at 19:30
This is going out in Beam 2.1. Once that is released, Google Cloud Dataflow 2.1 will be released containing this functionality. You can try it out today by building directly off of the latest Beam snapshot. – Reuven Lax Jul 18 '17 at 20:30
Another note: writing a custom FilenamePolicy is not terribly hard. You simply need to override the windowedFilename method, and map the input parameters into a filename. However note that this API will be changing slightly in upcoming releases. – Reuven Lax Jul 18 '17 at 20:31
Thanks again Reuvan. I tried writing a custom FilenamePolicy but I'm having trouble understanding how to override the windowedFilename method. It looks like it needs to accept a ResourceID and a Context (In addition to the string), and the Javadocs are also quite scarce on these. The challenge I'm running into is all of these things are abstract classes that in return require other abstract classes that are all Beam specific. It's not clear to me where I actually provide the GCP bucket and filename format across all of these. – Chris Schrader Jul 18 '17 at 21:08
Sorry should be more specific. I got this error when attempting to create a custom FilenamePolicy: "Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey." I'm not using a GroupByKey, I'm actively trying to avoid doing so. It seems like that is the DataFlow 1.x approach. And apologies name was autocorrected, Reuven*. – Chris Schrader Jul 18 '17 at 21:41
The message says that you need to apply some windowing to your PCollection read from pubsub, e.g. Window.into(FixedWindows.of(Duration.standardMinutes(5))) or something like that. withWindowedWrites() needs the collection to be windowed, because it writes (roughly speaking) the contents of a window to a file when the window is complete. If the collection is unwindowed (i.e. globally windowed), then the window is never complete (new data can always arrive into the global window), and it would never be able to write any files. – jkff Jul 20 '17 at 06:04
Jeez, thanks jkff! I actually just figured that out before I saw your comment. I got the pipeline to actually run now! In simplest from what I can tell, reading from PubSub is a streaming operation and writing to GCS is not (I can stream to something like BigQuery or BigTable easily), so you need to define the windows in which it will trigger the writes. Is that about right? – Chris Schrader Jul 20 '17 at 18:17
2

Correct. Writing to GCS needs to write entire files, so needs some way of grouping. Note that if you don't need the semantic correctness of windows, you can also simply use processing-time or count-based triggers for this. For example, if you just want each of your 10 shards to write a file every 10,000 elements you could instead do Window.into(new GlobalWindows()).triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(10000))); – Reuven Lax Jul 21 '17 at 16:50
Another note: Dataflow 2.1 (in the process of being released) provides default naming policies for windowed writes, so you might not need to write a custom filename policy. It also removes the restriction that you must specify numShards - if you leave it unset, the runner will pick a good shard count for you. – Reuven Lax Jul 21 '17 at 16:55

GCP Dataflow 2.0 PubSub to GCS

2 Answers2

Linked