I'm using Apache Beam 2.28.0 on Google Cloud DataFlow (with Scio SDK). I have a large input PCollection
(bounded) and I want to limit / sample it to a fixed number of elements, but I want to start the downstream processing as soon as possible.
Currently, when my input PCollection
has e.g. 20M elements and I want to limit it to 1M by using https://beam.apache.org/releases/javadoc/2.28.0/org/apache/beam/sdk/transforms/Sample.html#any-long-
input.apply(Sample.<String>any(1000000))
it waits until all of the 20M elements are read, which takes a long time.
How to efficiently limit number of elements to a fixed size and start downstream processing as soon as the limit is reached, discarding the rest of the input processing?