9

I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2) and ("hello", "world"), their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"), (2, "world")).

Any ideas how I could do that? Also, since the Cartesian product could be large, I'm hoping the solution would lazily create the product and thus avoid huge memory consumption.

Thanks!

Youness Bennani
  • 335
  • 2
  • 8
  • Do you have any more details on what you're trying to do? How big are each of the PCollections? There are several ways of accomplishing this, and which one is better depends on the reason you want the cartesian product and the actual PCollections involved. – Ben Chambers Jan 26 '16 at 21:29
  • The two PCollections are identical. They both contain roughly 100,000 tuples of type `(String, String)`. I'm using a dictionary of English words and got their phonetic transcription in order to generate 2-word puns, such as: "fantasti-CAL-ifornia". – Youness Bennani Jan 27 '16 at 00:51
  • For a direct cartesian solution, [this](http://stackoverflow.com/a/41051283/377366) seems to be the best answer available for now. – KobeJohn Feb 06 '17 at 01:41

1 Answers1

5

In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String> into a side input, and then you would have a ParDo that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String> of all the values, and you'd output the pairs (or you could in this DoFn choose to output only the pairs that line up).

This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.

Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection of values keyed by the 3-letter prefixes. You can also create the similar PCollection keyed by 3-letter suffixes. Then you can GroupByKey (or CoGroupByKey). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.

Ben Chambers
  • 6,070
  • 11
  • 16