2

Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same partition? If so, wouldn't it exceed the maximum size per partition? How would repartition work in this case?

Gabio
  • 9,126
  • 3
  • 12
  • 32
Arjunlal M.A
  • 119
  • 6

2 Answers2

2

The configuration spark.sql.files.maxPartitionBytes is effective only when reading files from file-based sources. So when you execute repartition, you reshuffle your existing Dataframe and the number of output partitions will be defined by repartition logic, which in your case will be 1.

Gabio
  • 9,126
  • 3
  • 12
  • 32
  • Is there no upper limit for the size of a single partition? So, in my case, would the whole 10 GB worth of records be stored in a single partition? – Arjunlal M.A Sep 23 '21 at 10:52
  • 1
    There is no limit on partition size. The only limit is the size of your executor. As long as your executor has enough resources, it will be good enough. – Gabio Sep 23 '21 at 10:54
  • In case of HDFS, what if the spark partition size is greater than the HDFS block size? – Arjunlal M.A Sep 24 '21 at 05:17
  • 1
    The answer remains the same. After the dataframe was created, the number of output partitions after `repartition` is defined only by `repartition` logic. – Gabio Sep 24 '21 at 06:46
1

The value of 128 MB comes from the spark property spark.sql.files.maxPartitionBytes which is only applicable when you create a dataframe after reading a file based source. Refer there for details https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options. This is to achieve max parallelism while reading. So, if you create a dataframe after transforming another dataframe or joining two dataframe, the partitions are not affect be this value. For example, you can read 10 GB of data and to a df.repartition(1) and this should work without any issues(assuming your executor has enough memory)

Vikas Saxena
  • 1,073
  • 1
  • 12
  • 21