Spark v2.4 pyspark
spark.range(100).orderBy('id', ascending=False).rdd
When I type the above, it immediately spawn a spark job. I find it suprising as I didn't even specify an action.
E.g. spark.range(100).repartition(10, 'id').sortWithinPartitions('id').rdd
works as expected in a way that no job is triggered..
A related question is Why does sortBy transformation trigger a Spark job?
It confirms RDD sortBy
can trigger an action.
But here I am using a DataFrame. spark.range(100).orderBy('id', ascending=False)
works alright. The job only gets triggered once I access .rdd
.