-1

I am learning Apache Spark recently. I installed Hadoop and Apache Spark via brew on my Hackintosh (El Capitan) with a i7-4790S CPU and 16GB RAM. I ran the SparkPi example as following:

/usr/local/Cellar/apache-spark/1.6.1/bin/run-example SparkPi 1000

and it took 43 second to finish.

16/06/27 00:54:05 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 43.165503 s

I have another PC running Ubuntu 16.04 with a i3-4170T CPU and 16GB RAM. I setup a Docker container to run Hadoop and Spark (same version as those on OS X). Interestingly, it took only 18 seconds to finish the same request.

16/06/28 16:22:49 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 18.264482 s

How come Spark on OS X with a faster CPU run much slower than Spark on Ubuntu?

Joseph Hui
  • 573
  • 6
  • 11
  • 2
    "I did the same thing in two completely different environments, and they performed differently". well, yeah... – Marc B Jun 28 '16 at 16:55
  • 2
    "And neither environment is one that Spark, a distributed computing tool, is actually designed to run in." – Jeff Jun 28 '16 at 17:07
  • I think, you should share what configurations of resources in terms of cores and memory you have set for running this test. – Amit Kumar Jun 28 '16 at 17:36

1 Answers1

1

For real Spark jobs, you'll often run into performance differences on different platforms due to differences in native library availability in whatever build you use.

In your particular case, I'm more skeptical whether the SparkPi example actually depends at all on native libraries in any meaningful way, and it could very well just be chalked up to variance if you didn't run the job many times to average over a lot of runs, but it's still conceivable that native libraries could affect some expensive scheduling or filesystem-access operations in a way which could show up even for the SparkPi example.

Community
  • 1
  • 1
Dennis Huo
  • 10,517
  • 27
  • 43