I use Spark 2.4 and use the %sql
mode to query tables.
If I am using a Window function on a large data-set, then which one between ORDER BY
vs SORT BY
will be more efficient from a query performance standpoint ?
I understand that ORDER BY
ensures global ordering but the computation gets pushed to only 1
reducer. However, SORT BY
will sort within each partition but the partitions may receive overlapping ranges.
I want to understand if SORT BY
too could be used in this case ? And Which one will be more efficient while processing a large data-set (say 100 M
rows) ?
For e.g.
ROW_NUMBER() OVER (PARTITION BY prsn_id ORDER BY purch_dt desc) AS RN
VS
ROW_NUMBER() OVER (PARTITION BY prsn_id SORT BY purch_dt desc) AS RN
Can anyone please help. Thanks.