ORDER BY vs SORT BY in Spark SQL

Question

I use Spark 2.4 and use the %sql mode to query tables.

If I am using a Window function on a large data-set, then which one between ORDER BY vs SORT BY will be more efficient from a query performance standpoint ?

I understand that ORDER BY ensures global ordering but the computation gets pushed to only 1 reducer. However, SORT BY will sort within each partition but the partitions may receive overlapping ranges.

I want to understand if SORT BY too could be used in this case ? And Which one will be more efficient while processing a large data-set (say 100 M rows) ?

For e.g.

ROW_NUMBER() OVER (PARTITION BY prsn_id ORDER BY purch_dt desc) AS RN

VS

ROW_NUMBER() OVER (PARTITION BY prsn_id SORT BY purch_dt desc) AS RN

Can anyone please help. Thanks.

please take a look https://stackoverflow.com/questions/40603202/what-is-the-difference-between-sort-and-orderby-functions-in-spark — Natalia, Aug 06 '20 at 08:18
Does this answer your question? [What is the difference between sort and orderBy functions in Spark](https://stackoverflow.com/questions/40603202/what-is-the-difference-between-sort-and-orderby-functions-in-spark) — Umair Khan, Aug 06 '20 at 08:39

score 0 · Answer 1 · answered Aug 06 '20 at 10:21

It does not matter whether you use SORT BY or ORDER BY. There is a notion about Hive that you are likely referring to, but you are using Spark, that has no such issue.
For partition BY ...the 1 Reducer aspect is only an issue if you have nothing to partition by. You do have prsn_id, so not an issue.

score 0 · Answer 2 · answered Aug 13 '20 at 09:24

sort by is applied at each bucket and does not guarantee that entire dataset is sorted. But order by is applied at entire dataset (in a single reducer). Since your query is partitioned and sorted/ordered for each partition key, the both usage returns the same output.

ORDER BY vs SORT BY in Spark SQL

2 Answers2