0

I've always read that Scala is much faster than pyspark for many operations, but recently I've read in a blog that since the release of Spark 2 the performance differences are much lower.

Is this maybe due to the Dataframe introduction? Does that mean that operations on dataframe take the same time with Scala and pyspark?

Does exist a detailed and recent performance report about Scala/pyspark differences?

user1403546
  • 1,680
  • 4
  • 22
  • 43
  • 1
    I'm not aware of a recent benchmark study about performance differences between scala spark and pyspark. Nevertheless, the DataFrame and Dataset APIs are built on top of the Spark SQL engine which uses Catalyst to generate an optimized logical and physical query plan. Across R, Java, Scala, or Python DataFrame/Dataset APIs, all relation type queries undergo the same code optimizer, providing the space and speed efficiency. – eliasah Oct 27 '17 at 13:32
  • 1
    But the problem isn't here. Apache Spark didn't gain any performance on RDDs since the last benchmark done. And sometimes you'll be need to resort back to RDDs if you need more control. Scala out performs Python here. – eliasah Oct 27 '17 at 13:33
  • 1
    This point that I have discussed in my earlier comments is just one part of what can be different. Unfortunately this question remains off-topic for being broad and I'm voting to close it ! – eliasah Oct 27 '17 at 13:36

0 Answers0