why are there so many partitions in a single-element RDD

Question

The following code returns 16 partitions. How is that possible to have 16 partitions for an array of 1 thing?

rdd = sc.parallelize([""])
rdd.getNumPartitions()

score 2 · Answer 1 · answered Sep 15 '18 at 08:21

The number of partitions in RDD created by sc.parallelize depends on the scheduler implementation used.

SchedulerBackend trait has this method -


def defaultParallelism(): Int

The CoarseGrainedSchedulerBackend (which is used by yarn) has this implementation -


    override def defaultParallelism(): Int = {
        conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
    }

LocalSchedulerBackend has following implementation


    override def defaultParallelism(): Int =
        scheduler.conf.getInt("spark.default.parallelism", totalCores)

Thats why your RDD has 16 partitions.

score 1 · Answer 2 · answered Sep 15 '18 at 06:03

In this case of parallelize api it depends on the Cluster manager.

In local mode it is the total number of cores of your machine

In Mesos fine grain mode it is 8

In yarn it’s total number of cores on all executor nodes or 2 whichever is higher.

These are the default settings if you won’t provide the number of partitions explicitly

Raphael Roth · Answer 3 · 2018-09-15T09:07:12.037

1

Yes, your rdd will have 16 partitions, but 15 of them will be empty. You can check this e.g. with rdd.mapPartitions (see Apache Spark: Get number of records per partition). The number 16 comes from spark.default.parallelism in your case and depends on your environment, but not on the size of your data.

In general empty partitions do not hurt, they will be finished very fast. You could also repartition or coalesce to 1 partition if you don't like empty partitions (see e.g. Dropping empty DataFrame partitions in Apache Spark), but I would not recommend that

edited Sep 15 '18 at 09:07

answered Sep 15 '18 at 08:50

Raphael Roth

26,751
15
88
145

In my use case, I just want to ask each executor to do something just once. I don't even need to pass any data to them but I can pass an id if absolutely needed. So, to do that properly, I would need to figure out how many executors are around and then change parallelism to match that? – fatdragon Sep 18 '18 at 14:04

why are there so many partitions in a single-element RDD

3 Answers3