First, to set up context for those reading that may not know the definition of a stable sort, I'll quote from this StackOverflow answer by Joey Adams
"A sorting algorithm is said to be stable if two objects with equal
keys appear in the same order in sorted output as they appear in the
input array to be sorted" - Joey Adams
Now, a window function in spark can be thought of as Spark processing mini-DataFrames of your entire set, where each mini-DataFrame is created on a specified key - "group_id" in this case.
That is, if the supplied dataframe had "group_id"=2, we would end up with two Windows, where the first only contains data with "group_id"=1 and another the "group_id"=2.
This is important to note, because we could test the effects of the .orderBy() call on a sample dataframe without having to actually worry about what is happening to a Window. To emphasize what is happening:
- Data is partitioned by a specified key
- Transformations are then applied to the 'mini-DataFrames' created in each window
Hence, for a pre-sorted input such as :
df = spark.createDataFrame(
[
{'group_id': 1, 'id': 1, 'text': 'one', 'type': 'a'},
{'group_id': 1, 'id': 1, 'text': 'two', 'type': 't'},
{'group_id': 1, 'id': 2, 'text': 'three', 'type': 'a'},
{'group_id': 1, 'id': 2, 'text': 'four', 'type': 't'},
{'group_id': 1, 'id': 5, 'text': 'five', 'type': 'a'},
{'group_id': 1, 'id': 6, 'text': 'six', 'type': 't'},
{'group_id': 1, 'id': 7, 'text': 'seven', 'type': 'a'},
{'group_id': 1, 'id': 9, 'text': 'eight', 'type': 't'},
{'group_id': 1, 'id': 9, 'text': 'nine', 'type': 'a'},
{'group_id': 1, 'id': 10, 'text': 'ten', 'type': 't'},
{'group_id': 1, 'id': 11, 'text': 'eleven', 'type': 'a'}
]
)
+--------+---+------+----+
|group_id| id| text|type|
+--------+---+------+----+
| 1| 1| one| a|
| 1| 1| two| t|
| 1| 2| three| a|
| 1| 2| four| t|
| 1| 5| five| a|
| 1| 6| six| t|
| 1| 7| seven| a|
| 1| 9| eight| t|
| 1| 9| nine| a|
| 1| 10| ten| t|
| 1| 11|eleven| a|
+--------+---+------+----+
We apply:
df.orderBy('id').show()
Resulting in:
+--------+---+------+----+
|group_id| id| text|type|
+--------+---+------+----+
| 1| 1| one| a|
| 1| 1| two| t|
| 1| 2| three| a|
| 1| 2| four| t|
| 1| 5| five| a|
| 1| 6| six| t|
| 1| 7| seven| a|
| 1| 9| nine| a|
| 1| 9| eight| t|
| 1| 10| ten| t|
| 1| 11|eleven| a|
+--------+---+------+----+
At first, this seems stable, but let's apply this to a DataFrame with the row with text="two" swapped with the row with text="three":
df = spark.createDataFrame(
[
{'group_id': 1, 'id': 1, 'text': 'one', 'type': 'a'},
{'group_id': 1, 'id': 2, 'text': 'three', 'type': 'a'},
{'group_id': 1, 'id': 1, 'text': 'two', 'type': 't'},
{'group_id': 1, 'id': 2, 'text': 'four', 'type': 't'},
{'group_id': 1, 'id': 5, 'text': 'five', 'type': 'a'},
{'group_id': 1, 'id': 6, 'text': 'six', 'type': 't'},
{'group_id': 1, 'id': 7, 'text': 'seven', 'type': 'a'},
{'group_id': 1, 'id': 9, 'text': 'eight', 'type': 't'},
{'group_id': 1, 'id': 9, 'text': 'nine', 'type': 'a'},
{'group_id': 1, 'id': 10, 'text': 'ten', 'type': 't'},
{'group_id': 1, 'id': 11, 'text': 'eleven', 'type': 'a'}
]
)
+--------+---+------+----+
|group_id| id| text|type|
+--------+---+------+----+
| 1| 1| one| a|
| 1| 2| three| a|
| 1| 1| two| t|
| 1| 2| four| t|
| 1| 5| five| a|
| 1| 6| six| t|
| 1| 7| seven| a|
| 1| 9| eight| t|
| 1| 9| nine| a|
| 1| 10| ten| t|
| 1| 11|eleven| a|
+--------+---+------+----+
Then apply:
df.orderBy(df.id).show()
Which results in:
+--------+---+------+----+
|group_id| id| text|type|
+--------+---+------+----+
| 1| 1| two| t|
| 1| 1| one| a|
| 1| 2| four| t|
| 1| 2| three| a|
| 1| 5| five| a|
| 1| 6| six| t|
| 1| 7| seven| a|
| 1| 9| nine| a|
| 1| 9| eight| t|
| 1| 10| ten| t|
| 1| 11|eleven| a|
+--------+---+------+----+
As you can see, even though the rows text="one" and text="two" appear in the same order, the .orderBy() swaps them around. Thus, we can assume the .orderBy() is not a stable sort.