0

Spark v2.4 pyspark

spark.range(100).orderBy('id', ascending=False).rdd

When I type the above, it immediately spawn a spark job. I find it suprising as I didn't even specify an action.

E.g. spark.range(100).repartition(10, 'id').sortWithinPartitions('id').rdd works as expected in a way that no job is triggered..

A related question is Why does sortBy transformation trigger a Spark job?

It confirms RDD sortBy can trigger an action.

But here I am using a DataFrame. spark.range(100).orderBy('id', ascending=False) works alright. The job only gets triggered once I access .rdd.

colinfang
  • 20,909
  • 19
  • 90
  • 173
  • 2
    https://stackoverflow.com/questions/41403670/why-does-sortby-transformation-trigger-a-spark-job – vaquar khan Sep 17 '19 at 19:43
  • Thank you this is related. But here I am using a DataFrame. `spark.range(100).orderBy('id', ascending=False)` works alright. The job only gets triggered once I access `.rdd` – colinfang Sep 17 '19 at 20:10

1 Answers1

1

Not all transformation is 100% lazy. OrderBy needs to evaluate the RDD to determine the range of data, so it involves both a transformation and an action.

Yudovin Artsiom
  • 99
  • 1
  • 10