I write following code for Spark:
val longList = 1 to 1000000 toList
val numsToAdd = sc.parallelize(longList,30)
val st = System.nanoTime()
println(numsToAdd.reduce((x,y) => x+y))
val et = System.nanoTime()
println("Time spend:"+(et-st))
I perform parallelize transformation with varies values for partitions. Time spend is as:
Partitions Time spend
default 97916961
10 111907094
20 141691820
30 158264230
My understanding is that with partitions parameter,spark divides the dataset in that many parts and perform parallel operations on each part. So intuitively, more the partitions faster must be the program.
But it seems its opposite.
I run my program in my local machine.
Do partitions work some other way?