Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same partition? If so, wouldn't it exceed the maximum size per partition? How would repartition work in this case?
2 Answers
The configuration spark.sql.files.maxPartitionBytes
is effective only when reading files from file-based sources. So when you execute repartition
, you reshuffle your existing Dataframe and the number of output partitions will be defined by repartition
logic, which in your case will be 1.

- 9,126
- 3
- 12
- 32
-
Is there no upper limit for the size of a single partition? So, in my case, would the whole 10 GB worth of records be stored in a single partition? – Arjunlal M.A Sep 23 '21 at 10:52
-
1There is no limit on partition size. The only limit is the size of your executor. As long as your executor has enough resources, it will be good enough. – Gabio Sep 23 '21 at 10:54
-
In case of HDFS, what if the spark partition size is greater than the HDFS block size? – Arjunlal M.A Sep 24 '21 at 05:17
-
1The answer remains the same. After the dataframe was created, the number of output partitions after `repartition` is defined only by `repartition` logic. – Gabio Sep 24 '21 at 06:46
The value of 128 MB comes from the spark property spark.sql.files.maxPartitionBytes
which is only applicable when you create a dataframe after reading a file based source. Refer there for details https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options. This is to achieve max parallelism while reading. So, if you create a dataframe after transforming another dataframe or joining two dataframe, the partitions are not affect be this value. For example, you can read 10 GB of data and to a df.repartition(1)
and this should work without any issues(assuming your executor has enough memory)

- 1,073
- 1
- 12
- 21