0

In spark, if we have to reshuffle the data, we can use repartition of a dataframe. What's the way to do the same in apache beam for a pcollection?

In pyspark,

new_df = df.repartition(4)
bigbounty
  • 16,526
  • 5
  • 37
  • 65

1 Answers1

0

From this doc:

You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshuffle is supported by Dataflow even though it is marked deprecated in the Apache Beam documentation.

Though I'm not sure if Reshuffle is and still will be supported by other runners of Beam.

The java doc and further explanation of Reshuffle: Apache Beam/Dataflow Reshuffle

ningk
  • 1,298
  • 1
  • 7
  • 7
  • The question is how to do it in apache beam? what's the method name to do it? – bigbounty May 04 '21 at 18:37
  • It is called `Reshuffle`, some [examples](https://www.codota.com/code/java/methods/org.apache.beam.sdk.transforms.Reshuffle/viaRandomKey). You can also implement your own reshuffle logic, for example: https://stackoverflow.com/questions/40767189/how-to-reshuffle-a-pcollectiont – ningk May 04 '21 at 19:57
  • Reshuffle is supported on all existing runners. – robertwb May 07 '21 at 02:29
  • For Python, you can do `reshuffled_pcoll = original_pcoll | beam.Reshuffle()` and in Java you can do `reshuffled_pcoll = original_pcoll.apply(Reshuffle.viaRandomKey())`. – robertwb May 07 '21 at 02:32