Clarity on number of partitions in spark dataframe

Question

from pyspark.sql.types import *

schema = StructType([StructField("type", StringType(), True), StructField("average", IntegerType(), True)])
values = [('A', 19), ('B', 17), ('C', 10)]
df = spark.createDataFrame(values, schema)

parts = df.rdd.getNumPartitions()

print(parts)

Output is 44

How is spark creating 44 partitions for 3 records dataframe?

import pyspark.sql.functions as F
df.withColumn('p_id', F.spark_partition_id()).show()

Output :

+----+-------+----+
|type|average|p_id|
+----+-------+----+
|   A|     19|  14|
|   B|     17|  29|
|   C|     10|  43|
+----+-------+----+

score 1 · Answer 1 · answered Sep 23 '19 at 11:56

When Dataset/Dataframe is created out of a collection it does take rows number into account. Eventually it comes down to LocalTableScanExec, look here

numParallelism: Int = math.min(math.max(unsafeRows.length, 1),  sqlContext.sparkContext.defaultParallelism)
rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)

Where unsafeRows.length equals to the provided collection size.

Also, look at this answer for several related settings.

score 0 · Answer 2 · answered Sep 23 '19 at 11:23

Cause Spark initially create N number of partitions regardless of data. For example, I've run Spark locally with "local[4]" and create a DF from 2 rows df.rdd().getNumPartitions() would return 4, cause there is 4 cores for Spark job.

If I do the next:

df.repartition(2).rdd().getNumPartitions()

result would be 2.

Clarity on number of partitions in spark dataframe

2 Answers2