How to do a cartesian product of two PCollections in Dataflow?

Question

I would like to do a cartesian product of two PCollections. Neither PCollection can fit into memory, so doing side input is not feasible.

My goal is this: I have two datasets. One is many elements of small size. The other is few (~10) of very large size. I would like to take the product of these two elements and then produce key-value objects.

score 4 · Accepted Answer · edited Jul 27 '21 at 18:31

I think CoGroupByKey might work in your situation:

https://cloud.google.com/dataflow/model/group-by-key#join

That's what I did for a similar use-case. Though mine had probably not been constrained by the memory (have you tried a larger cluster with bigger machines?):

PCollection<KV<String, TableRow>> inputClassifiedKeyed = inputClassified
        .apply(ParDo.named("Actuals : Keys").of(new ActualsRowToKeyedRow()));

PCollection<KV<String, Iterable<Map<String, String>>>> groupedCategories = p
[...]
.apply(GroupByKey.create());

So the collections are keyed by the same key.

Then I declared the Tags:

final TupleTag<Iterable<Map<String, String>>> categoryTag = new TupleTag<>();
final TupleTag<TableRow> actualsTag = new TupleTag<>();

Combined them:

PCollection<KV<String, CoGbkResult>> actualCategoriesCombined =
        KeyedPCollectionTuple.of(actualsTag, inputClassifiedKeyed)
                .and(categoryTag, groupedCategories)
                .apply(CoGroupByKey.create());

And in my case the final step - reformatting the results (from the tagged groups in the continuous flow:

actualCategoriesCombined.apply(ParDo.named("Actuals : Formatting").of(
    new DoFn<KV<String, CoGbkResult>, TableRow>() {
        @Override
        public void processElement(ProcessContext c) throws Exception {
            KV<String, CoGbkResult> e = c.element();

            Iterable<TableRow> actualTableRows =
                    e.getValue().getAll(actualsTag);
            Iterable<Iterable<Map<String, String>>> categoriesAll =
                    e.getValue().getAll(categoryTag);

            for (TableRow row : actualTableRows) {
                // Some of the actuals do not have categories
                if (categoriesAll.iterator().hasNext()) {
                    row.put("advertiser", categoriesAll.iterator().next());
                }
                c.output(row);
            }
        }
    }))

Hope this helps. Again - not sure about the in memory constraints. Please do tell the results if you try this approach.

Due to my lack of experience in both Java and dataflow, it is taking a long time for me to parse this. If possible, could you post a simple example in python also? I think it will be better to expand this question instead of me asking a new question. — KobeJohn, Feb 06 '17 at 03:17
For reference, I'm coming from Spark where this is very simple: `collection_a.cartesian(collection_b)` — KobeJohn, Feb 06 '17 at 03:28

Igor Zinin · Answer 2 · 2019-09-13T14:33:50.370

to create cartesian product use Apache Beam extension Join

import org.apache.beam.sdk.extensions.joinlibrary.Join;

...

// Use function Join.fullOuterJoin(final PCollection<KV<K, V1>> leftCollection, final PCollection<KV<K, V2>> rightCollection, final V1 leftNullValue, final V2 rightNullValue)
// and the same key for all rows to create cartesian product as it is shown below:

    public static void process(Pipeline pipeline, DataInputOptions options) {
        PCollection<KV<Integer, CpuItem>> cpuList = pipeline
                .apply("ReadCPUs", TextIO.read().from(options.getInputCpuFile()))
                .apply("Creating Cpu Objects", new CpuItem()).apply("Preprocess Cpu",
                        MapElements
                                .into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(CpuItem.class)))
                                .via((CpuItem e) -> KV.of(0, e)));

        PCollection<KV<Integer, GpuItem>> gpuList = pipeline
                .apply("ReadGPUs", TextIO.read().from(options.getInputGpuFile()))
                .apply("Creating Gpu Objects", new GpuItem()).apply("Preprocess Gpu",
                        MapElements
                                .into(TypeDescriptors.kvs(TypeDescriptors.integers(), TypeDescriptor.of(GpuItem.class)))
                                .via((GpuItem e) -> KV.of(0, e)));

        PCollection<KV<Integer,KV<CpuItem,GpuItem>>>  cartesianProduct = Join.fullOuterJoin(cpuList, gpuList, new CpuItem(), new GpuItem());
        PCollection<String> finalResultCollection = cartesianProduct.apply("Format results", MapElements.into(TypeDescriptors.strings())
                .via((KV<Integer, KV<CpuItem,GpuItem>> e) -> e.getValue().toString()));
        finalResultCollection.apply("Output the results",
                TextIO.write().to("fps.batchproc\\parsed_cpus").withSuffix(".log"));
        pipeline.run();
    }

in the code above in this line

...
        .via((CpuItem e) -> KV.of(0, e)));
...

i create Map with key equals to 0 for all rows available in the input data. As the result all rows are matched. That is equal to SQL expression JOIN without WHERE clause

How to do a cartesian product of two PCollections in Dataflow?

2 Answers2

Linked