What is the difference between sort and orderBy functions in Spark

Question

What is the difference between sort and orderBy spark DataFrame?

scala> zips.printSchema
root
 |-- _id: string (nullable = true)
 |-- city: string (nullable = true)
 |-- loc: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- pop: long (nullable = true)
 |-- state: string (nullable = true)

Below commands produce same result:

zips.sort(desc("pop")).show
zips.orderBy(desc("pop")).show

score 38 · Accepted Answer · answered Nov 15 '16 at 06:20

38

OrderBy is just an alias for the sort function.

From the Spark documentation:

  /**
   * Returns a new Dataset sorted by the given expressions.
   * This is an alias of the `sort` function.
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)

answered Nov 15 '16 at 06:20

Shivansh

3,454
23
46

2

From spark documentation, it seems that SORT BY and ORDER BY are not the same. https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html Am I missing something? – Fardin Abdi Apr 02 '21 at 21:09
2

We have to classify properly to understand it clearly. The clauses in spark sql: order by- does whole ordering. sort by: partition wise ordering. The functions in spark dataframe api: sort(), orderBy(): does whole ordering. sortWithinPartitions(): partition wise ordering. – Ankit Mahajan Apr 11 '21 at 06:08
3

But in pyspark, I can find orderby is just an alias of sort function https://github.com/apache/spark/blob/0c9c8ff56933e6ae13454845e831746360af84e3/python/pyspark/sql/dataframe.py#L1423 – Bharath Ram Jul 11 '21 at 13:42
3

Even in scala, orderby is an alias of sort function https://github.com/apache/spark/blob/5d74ace648422e7a9bff7774ac266372934023b9/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1306 – Bharath Ram Jul 11 '21 at 14:34

score 19 · Answer 2 · edited Dec 10 '20 at 14:41

19

They are NOT the SAME.

The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered.

Reference :https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html

The ORDER BY clause is used to return the result rows in a sorted manner in the user specified order. Unlike the SORT BY clause, this clause guarantees a total order in the output.

Reference : https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html

edited Dec 10 '20 at 14:41

kamui

3,339
3
26
44

answered Dec 10 '20 at 14:13

RaHuL VeNuGoPaL

447
5
7

3

If thats the case, whats really the use of using sort by? also i actually havent noticed this distinction in multi partition yet. – ss301 Sep 08 '21 at 13:35
2

In the SQL API there is this difference between `SORT BY` and `ORDER BY`. The question was in Scala API, where the DataFrame methods `sort()` and `orderBy()` do actually the same thing. To do SQL `SORT BY`, Scala has `sortWithinPartitions()`. Similarly in the PySpark API. – Melkor.cz Oct 24 '22 at 11:20

score 0 · Answer 3 · edited Jun 12 '20 at 15:00

sort() function sorts the output in each bucket by the given columns on the file system. It does not guaranty the order of output data. Whereas The orderBy() happens in two phase .

First inside each bucket using sortBy() then entire data has to be brought into a single executer for over all order in ascending order or descending order based on the specified column. It involves high shuffling and is a costly operation. But as

The sort() operation happen inside each an individual bucket and is a light weight operation.

Here is a example:

Preparing data

>>> listOfTuples = [(16,5000),(10,3000),(13,2600),(19,1800),(11,4000),(17,3100),(14,2500),(20,2000)]
>>> tupleRDD = sc.parallelize(listOfTuples,2)
>>> tupleDF = tupleRDD.toDF(["Id","Salary"])

The data looks like :

>>> tupleRDD.glom().collect()
[[(16, 5000), (10, 3000), (13, 2600), (19, 1800)], [(11, 4000), (17, 3100), (14, 2500), (20, 2000)]]
>>> tupleDF.show()
+---+------+
| Id|Salary|
+---+------+
| 16|  5000|
| 10|  3000|
| 13|  2600|
| 19|  1800|
| 11|  4000|
| 17|  3100|
| 14|  2500|
| 20|  2000|
+---+------+

Now the sort operation will be

>>> tupleDF.sort("id").show()
+---+------+
| Id|Salary|
+---+------+
| 10|  3000|
| 11|  4000|
| 13|  2600|
| 14|  2500|
| 16|  5000|
| 17|  3100|
| 19|  1800|
| 20|  2000|
+---+------+

See, the order is not as expected. Now if we see the orederBy operation :

>>> tupleDF.orderBy("id").show()
+---+------+
| Id|Salary|
+---+------+
| 10|  3000|
| 11|  4000|
| 13|  2600|
| 14|  2500|
| 16|  5000|
| 17|  3100|
| 19|  1800|
| 20|  2000|
+---+------+

It maintains the overall order of data.

I did not understand what you mean by "the order is not as expected". Both the outputs seem same to me. — Niranjan Viladkar, Sep 08 '20 at 11:55

What is the difference between sort and orderBy functions in Spark

3 Answers3

Linked