1

I'm hearing a common theme running that I should only do serious programming in Scala on Spark (1.5.1). Real power users use Scala. It is said that Python is great for analytics but in the end the code should be written to Scala to finalise. There are a number of reasons I'm hearing:

  1. Spark is written in Scala so will always be faster than any other language implementation on top of it.
  2. Spark releases always favour more features being visible and enabled for Scala API than Python API.

Are there any truth's to the above? I'm a little sceptical.

Thanks

Dan
  • 769
  • 2
  • 8
  • 23
  • I don't understand your question. It should be obvious that the only language even worth considering on Spark is Clojure. It is no Haskell but we all have to compromise, don't we? Not to mention any kind of programming other than serious should be forbidden :) Seriously though I am voting to close this question. 1. Going outside JVM requires some overhead. Does it mean your program will slower? Maybe. It depends on a context. 2. Yes, new features come first to Scala API. Some may never be introduced in Python due to internal limitations. Beyond that there is no good answer here. – zero323 Oct 07 '15 at 06:29
  • I don't understand why you want to close this. I'm enquiring into the view that Scala obtains additional features than PySpark. Is there evidence to state that The Scala API in Spark has this as a policy? – Dan Oct 07 '15 at 08:38
  • Evidence is simple - Spark source. It is definitely not a policy but every part of PySpark API requires either a wrapper around Scala API or a separate implementation on top of existing Python API. Regarding internal limitations. Here is one example: http://stackoverflow.com/q/31684842/1560062. Why vote to close? Because in my opinion it is to close to Scala vs Python which completely pointless discussion. – zero323 Oct 07 '15 at 09:10

1 Answers1

2

The Spark Dataframe API performs the same whether you're running it in Scala, Pyspark, or Java. However, the RDD API runs much faster in Scala than on Pyspark

Databricks have a very good post on some recent performance improvements in Spark.

The Scala API definitely gets more testing, and more new functionality first, though it's not always the case that a new feature is only available in Scala or Java.

Personally, I would say the effort required to learn enough Scala to get by is worth it - you don't need to be a Scala expert to get the benefits of working with it in Spark.

Ewan Leith
  • 1,655
  • 11
  • 10
  • _Dataframe API performs the same whether you're running it in Scala, Pyspark,_ - it is true only when you don't UDFs and UDTs. – zero323 Oct 07 '15 at 09:03
  • True, any code you write in Python that's not calling the dataframe API is going to be as slow as normal Python, whether it's UDFs + UDTs or just string manipulation, etc. – Ewan Leith Oct 07 '15 at 09:08
  • 1
    It doesn't mean it will be slower than Scala (http://stackoverflow.com/a/32471016/1560062) but using PySpark adds another layer of complexity which is usually well hidden but it can bite you when you least expect it :) – zero323 Oct 07 '15 at 09:18