The following code returns 16 partitions. How is that possible to have 16 partitions for an array of 1 thing?
rdd = sc.parallelize([""])
rdd.getNumPartitions()
The following code returns 16 partitions. How is that possible to have 16 partitions for an array of 1 thing?
rdd = sc.parallelize([""])
rdd.getNumPartitions()
The number of partitions in RDD created by sc.parallelize
depends on the scheduler implementation used.
SchedulerBackend
trait has this method -
def defaultParallelism(): Int
The CoarseGrainedSchedulerBackend (which is used by yarn) has this implementation -
override def defaultParallelism(): Int = { conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) }
LocalSchedulerBackend has following implementation
override def defaultParallelism(): Int = scheduler.conf.getInt("spark.default.parallelism", totalCores)
Thats why your RDD has 16 partitions.
In this case of parallelize api it depends on the Cluster manager.
In local mode it is the total number of cores of your machine
In Mesos fine grain mode it is 8
In yarn it’s total number of cores on all executor nodes or 2 whichever is higher.
These are the default settings if you won’t provide the number of partitions explicitly
Yes, your rdd will have 16 partitions, but 15 of them will be empty. You can check this e.g. with rdd.mapPartitions
(see Apache Spark: Get number of records per partition). The number 16 comes from spark.default.parallelism
in your case and depends on your environment, but not on the size of your data.
In general empty partitions do not hurt, they will be finished very fast. You could also repartition or coalesce to 1 partition if you don't like empty partitions (see e.g. Dropping empty DataFrame partitions in Apache Spark), but I would not recommend that