2

I'm fairly new to Spark, but I want to do the simplest possible experiment that will show the difference between single- and multi-core processing times.

I have the following function

def timeit(a: Seq[Int], iters: Int): Unit = {
     val now = System.nanoTime
     for (i <- 1 to iters) {
     sc.parallelize(a).map((_, 1)).reduceByKey(_+_)
     }
     println(System.nanoTime - now)
     }

And I feed it the following random sequence A

val rgen = new scala.util.Random(0)
val A = (1 to 10000000).map(_ => rgen.nextInt(10))

When I start the shell first with

/bin/spark-shell --master local[1]

and run timeit(A, 100) three times, the results are

21373321483
21725216009
23148133684

I then start the shell with

/bin/spark-shell --master local[*]

(I have an 8 core intel i7) and run timeit(A, 100) three times again getting

20127289893
20580838235
20371447280

I understand that I shouldn't expect an 8x speedup, but I would think the difference would be a lot more pronounced than this.

Any ideas why this might be happening? I have done quite a bit of searching, and I can't find anyone else who has done a simple experiment like this. Thank you.

EDIT:

Upon further exploration, I have found that a key to improving performance is to change the number of partitions in the call to sc.parallelize in the function. For instance I changed the function to this:

def timeit(a: Seq[Int], iters: Int, slices: Int): Unit = {
    val now = System.nanoTime
    for (i <- 1 to iters) {
      sc.parallelize(a, numSlices = slices).map((_, 1)).reduceByKey(_ + _).collect()
     }
 println(System.nanoTime - now)
 }

and, after starting the shell with 8 cores (./bin/spark-shell --master local[*]), called timeit(A, 100, n) four times for each n in the range 1 through 10 . The following averages resulted:

Runtime vs number of RDD partitions

I also found that if you do this same thing with just 1 core (./bin/spark-shell --master local[1]), changing the number of partitions in the RDD has no effect on run time, and the time is approximately the same as the run time for 8 cores and 1 partition.

I'm aware that leaving out the optional numSlices variable in sc.parallelize lets Spark decide on the correct number, but at this point I'm not sure how this is done. Regardless, the above is more or less what I wanted to show.

tarski
  • 152
  • 9
  • Thanks @zero323. I suppose the question you reference is close enough, even though I think of counting as closer to "Embarrassingly parallel" than sorting. Also the numbers I get don't seem to follow Amdahl's law. – tarski Apr 21 '17 at 21:26
  • It was more a suggestion. If you want I can reopen this. – zero323 Apr 21 '17 at 21:50
  • @zero323, see above edit. I'll let you decide. The comments in the question that you reference are certainly valid, but I think the above observation is important. – tarski Apr 22 '17 at 03:22

0 Answers0