How partitions work in spark parallelize?

Question

I write following code for Spark:

val longList = 1 to 1000000 toList
val numsToAdd = sc.parallelize(longList,30)
val st = System.nanoTime()
println(numsToAdd.reduce((x,y) => x+y))
val et = System.nanoTime()
println("Time spend:"+(et-st))

I perform parallelize transformation with varies values for partitions. Time spend is as:

 Partitions     Time spend

 default        97916961

 10             111907094

 20             141691820

 30             158264230

My understanding is that with partitions parameter,spark divides the dataset in that many parts and perform parallel operations on each part. So intuitively, more the partitions faster must be the program.

But it seems its opposite.

I run my program in my local machine.

Do partitions work some other way?

your dataset is much too small to really utilize Spark's benefits. As long as the data fits in memory, don't use Spark but use plain Scala collection API — Raphael Roth, Jan 06 '18 at 11:05

score 1 · Answer 1 · answered Jan 06 '18 at 10:45

In general, smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead.

Increasing partitions count will make each partition to have less data (or not at all!)

Too few partitions You will not utilize all of the cores available in the cluster.

Too many partitions There will be excessive overhead in managing many small tasks.

According to spark docs ideal value is 2-3 times number of cores.

Spark RDD Partition

Spark Repartition

How partitions work in spark parallelize?

1 Answers1