1

I use Spark 2.4 and use the %sql mode to query tables.

If I am using a Window function on a large data-set, then which one between ORDER BY vs SORT BY will be more efficient from a query performance standpoint ?

I understand that ORDER BY ensures global ordering but the computation gets pushed to only 1 reducer. However, SORT BY will sort within each partition but the partitions may receive overlapping ranges.

I want to understand if SORT BY too could be used in this case ? And Which one will be more efficient while processing a large data-set (say 100 M rows) ?

For e.g.

ROW_NUMBER() OVER (PARTITION BY prsn_id ORDER BY purch_dt desc) AS RN

VS

ROW_NUMBER() OVER (PARTITION BY prsn_id SORT BY purch_dt desc) AS RN

Can anyone please help. Thanks.

dexter80
  • 75
  • 2
  • 12
  • 1
    please take a look https://stackoverflow.com/questions/40603202/what-is-the-difference-between-sort-and-orderby-functions-in-spark – Natalia Aug 06 '20 at 08:18
  • 5
    Does this answer your question? [What is the difference between sort and orderBy functions in Spark](https://stackoverflow.com/questions/40603202/what-is-the-difference-between-sort-and-orderby-functions-in-spark) – Umair Khan Aug 06 '20 at 08:39
  • Wires crossed here. – thebluephantom Aug 06 '20 at 08:51

2 Answers2

0
  • It does not matter whether you use SORT BY or ORDER BY. There is a notion about Hive that you are likely referring to, but you are using Spark, that has no such issue.

  • For partition BY ...the 1 Reducer aspect is only an issue if you have nothing to partition by. You do have prsn_id, so not an issue.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
0

sort by is applied at each bucket and does not guarantee that entire dataset is sorted. But order by is applied at entire dataset (in a single reducer). Since your query is partitioned and sorted/ordered for each partition key, the both usage returns the same output.