1

Is there any method to split Spark partition without through network and shuffle, for example

# p stands for partition
machine 1:
p1: 1,2 p2: 3,4
machine 2:
p3: 5,6 p4: 7,8

what I want to have is

machine 1: 
p1:1, p2:2, p3:3, p4:4
machine 2:
p5:5, p6:6, p7:7, p8:8

Is there any way to do this? (I think no network transmit and shuffle here)

PS:

This is the reverse of coalesce, if I call coalesce(2) then I suppose it would be

machine 1: p1: 1,2,3,4 machine 2: p2: 5,6,7,8

where data does not go through network and no shuffle would be called, and coalesce(1) will cause network transmit because data in machine 2 all goes to machine 1?

Litchy
  • 623
  • 7
  • 23

1 Answers1

0

The repartition API can help provided the application code is written in a certain way

  1. Read a dataset and repartition with column a. This would cause a full shuffle across networks which means that a partition is created for each unique value of a.

  2. Once Step 1 is done if you now repartition the dataset on column a and b this would result in new partitions getting created with minimal shuffle.

You can read more about Hash Partitioner here - How does HashPartitioner work?

Jayadeep Jayaraman
  • 2,747
  • 3
  • 15
  • 26
  • I think your step 2 is based on partitions which has the same keys. So you firstly need to do something like `groupbyKey` in step 1? Is there any method to just split without considering any keys and cause no shuffle at all. – Litchy Nov 13 '19 at 07:35
  • It is not possible to create multiple new partitions without network shuffling. – Jayadeep Jayaraman Nov 13 '19 at 08:57
  • I think theoretically it make sense to have this operation – Litchy Nov 14 '19 at 12:50
  • Yes it does and there are some options like setting `spark.sql.shuffle.partitions` to a higher value. Also, the `COALESCE` option is present to handle scenarios wherein many small files are read from the directory or lot of filters are applied, more details here - https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L127. If this answer was helpful please do not forget to accept the answer. – Jayadeep Jayaraman Nov 14 '19 at 14:58