0

I hope I do not get negative votes for this question. It is confusing across multiple spark versions, so let me ask it anyway. Note, this question is purely from performance perspective and not developer productivity/skill perspective. And I am new to spark, and many would like to know the latest status from 2017 perspective.

I am aware of JIT issue with python, and that is not the question here. It is purely from PySpark perspective.


I still fail to understand on why PySpark is reportedly slow when compared to using spark api from scala directly (or if at all it is a false statement). Based on my search, the performance impact is based on the API being accessed.

For RDD: Fundametally, data from spark worker is serialized and sent to python worker. Double serialization in some operations makes it expensive (it of-course depends on the staged pipeline and operation. but if there is a shuffle operation, then that would result in the python process communicating with java worker again, and hence serialization). This talk sheds light on it.

But things look different with the Datasets API. And reportedly it performs the same from all languages (source).

Question is:

  • Is my understand correct from above? Can someone put more light on when actually is PySpark slow? Or is the slowness just attributed to lack of JIT, rather than any pyspark intricacies.
  • What practical problems are faced with PySpark if RDD's are used
Jatin
  • 31,116
  • 15
  • 98
  • 163

1 Answers1

2

If you use only built-in functions on the dataframe api then the overhead of python should be very low (just the api wrapping). If, however, you use UDF or anything which maps to RDD (e.g. map) then pyspark would be much slower.

The reasons for it being slower are well explained in the video you shared.

Assaf Mendelson
  • 12,701
  • 5
  • 47
  • 56