Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

Question

Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same partition? If so, wouldn't it exceed the maximum size per partition? How would repartition work in this case?

score 2 · Accepted Answer · answered Sep 23 '21 at 10:36

2

The configuration spark.sql.files.maxPartitionBytes is effective only when reading files from file-based sources. So when you execute repartition, you reshuffle your existing Dataframe and the number of output partitions will be defined by repartition logic, which in your case will be 1.

answered Sep 23 '21 at 10:36

Gabio

9,126
3
12
32

Is there no upper limit for the size of a single partition? So, in my case, would the whole 10 GB worth of records be stored in a single partition? – Arjunlal M.A Sep 23 '21 at 10:52
1

There is no limit on partition size. The only limit is the size of your executor. As long as your executor has enough resources, it will be good enough. – Gabio Sep 23 '21 at 10:54
In case of HDFS, what if the spark partition size is greater than the HDFS block size? – Arjunlal M.A Sep 24 '21 at 05:17
1

The answer remains the same. After the dataframe was created, the number of output partitions after `repartition` is defined only by `repartition` logic. – Gabio Sep 24 '21 at 06:46

Vikas Saxena · Answer 2 · 2021-09-23T11:06:39.767

The value of 128 MB comes from the spark property spark.sql.files.maxPartitionBytes which is only applicable when you create a dataframe after reading a file based source. Refer there for details https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options. This is to achieve max parallelism while reading. So, if you create a dataframe after transforming another dataframe or joining two dataframe, the partitions are not affect be this value. For example, you can read 10 GB of data and to a df.repartition(1) and this should work without any issues(assuming your executor has enough memory)

Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

2 Answers2