I'm fairly new to Spark, but I want to do the simplest possible experiment that will show the difference between single- and multi-core processing times.
I have the following function
def timeit(a: Seq[Int], iters: Int): Unit = {
val now = System.nanoTime
for (i <- 1 to iters) {
sc.parallelize(a).map((_, 1)).reduceByKey(_+_)
}
println(System.nanoTime - now)
}
And I feed it the following random sequence A
val rgen = new scala.util.Random(0)
val A = (1 to 10000000).map(_ => rgen.nextInt(10))
When I start the shell first with
/bin/spark-shell --master local[1]
and run timeit(A, 100)
three times, the results are
21373321483
21725216009
23148133684
I then start the shell with
/bin/spark-shell --master local[*]
(I have an 8 core intel i7) and run timeit(A, 100)
three times again getting
20127289893
20580838235
20371447280
I understand that I shouldn't expect an 8x speedup, but I would think the difference would be a lot more pronounced than this.
Any ideas why this might be happening? I have done quite a bit of searching, and I can't find anyone else who has done a simple experiment like this. Thank you.
EDIT:
Upon further exploration, I have found that a key to improving performance is to change the number of partitions in the call to sc.parallelize in the function. For instance I changed the function to this:
def timeit(a: Seq[Int], iters: Int, slices: Int): Unit = {
val now = System.nanoTime
for (i <- 1 to iters) {
sc.parallelize(a, numSlices = slices).map((_, 1)).reduceByKey(_ + _).collect()
}
println(System.nanoTime - now)
}
and, after starting the shell with 8 cores (./bin/spark-shell --master local[*]
), called timeit(A, 100, n)
four times for each n in the range 1 through 10 . The following averages resulted:
I also found that if you do this same thing with just 1 core (./bin/spark-shell --master local[1]
), changing the number of partitions in the RDD has no effect on run time, and the time is approximately the same as the run time for 8 cores and 1 partition.
I'm aware that leaving out the optional numSlices variable in sc.parallelize lets Spark decide on the correct number, but at this point I'm not sure how this is done. Regardless, the above is more or less what I wanted to show.