1

When I read the explanation in the accepted answer to this question, I cannot see how order by NEWID() offers randomness for the whole table. Here is the query plan given in the accepted answer:

enter image description here

The query plan, which is also what's given to me by my local copy of SSMS, assigns a random guid value to every row, but this assignment would produce a result set that follows the order of rows that I'd get if do a SELECT * FROM whatever_table.

Yes, sql does not guarantee any order for result sets unless order by is used, but almost all implementations I've seen return a result set that strongly overlaps with the insertion order, i.e. the physical layout of data on the disk, for obvious IO efficiency reasons.

Since Sql Server uses a TOP N Sort, it means it'll just randomise the order of the top N rows of the result set that I'd get with SELECT * FROM whatever_table. That's in a way a randomised ordering of more or less the same result set every time the query is run, not a randomisation of the rows from the whole table.

This answer to a different question however, filters out randomly generated values across the whole table's rows and randomises the selection.

Am I wrong?

Update: Of course I'm wrong. My mistake lies in my incorrect assumption of how TOP N Sort operates. I thought it takes first N rows of a result set, then orders that subset. Instead, it finds the Max (presumably Min if order by newid() asc is used) and stops finding the next value when N values are found. Thanks for the comments that allowed me to see my error in my thinking.

mahonya
  • 9,247
  • 7
  • 39
  • 68
  • 3
    I don't see how "this assignment would produce a result set that follows the order of rows" follows from the preceding sentence(s). – Damien_The_Unbeliever Dec 13 '21 at 09:56
  • 1
    A very simple test is all you need to show your presumption is not correct; the `order by newid()` is applied *before* `top` – Stu Dec 13 '21 at 10:02
  • 2
    `NEWID` values aren't produced sequentially, so if *every* row is assigned a value, and *then* you order them then the sort you get would likely be very different to the order the data was inserted, and an arbitrary order when an `ORDER BY` is omitted. – Thom A Dec 13 '21 at 10:04
  • The explanation for `order by newid()` is that it generates a random value for all the rows. I'm assuming this is the same as a calculated column added to a `SELECT` clause, which produces the same order of rows with a `SELECT * FROM...` – mahonya Dec 13 '21 at 10:07
  • Perhaps you need to actually put the queries you are confused by in your question, @mahonya . – Thom A Dec 13 '21 at 10:10
  • 2
    If you have a computed column in the select clause, it produces different values for different rows *and you use that in the `ORDER BY`*, then it very definitely affects the order of the rows output. – Damien_The_Unbeliever Dec 13 '21 at 10:10
  • Probably you should just try adding `newid()` to the `SELECT` list of the query and look at the results. Maybe you think it returns the same value for all rows as `rand()` does? – Martin Smith Dec 13 '21 at 10:16
  • Damien_The_Unbeliever I can see that my misunderstanding is due to my incorrect assumption of how `TOP N SORT` operates. I thought it'd first identify N rows, then do a sort on them, which was really dumb of me to be honest. If you'd be kind enough to write your last brief comment as an answer, I'll accept it and update my Q with where my error lies. – mahonya Dec 13 '21 at 10:16
  • 1
    `TOP N SORT` with `N=10` just needs to reserve 10 slots for the 10 results to be returned. Then for each row see if it has a value less than the max value in the reserved slots. And if so allocate it a slot and potentially discard a row that is pushed out. – Martin Smith Dec 13 '21 at 10:26
  • 1
    Also, please forget about this right quick: "but almost all implementations I've seen return a result set that strongly overlaps with the insertion order, i.e. the physical layout of data on the disk, for obvious IO efficiency reasons". For starters, this instantly flies out the window if parallelism happens (which could be sudden and unexpected one day as the table becomes "large enough") and it doesn't work if page splits happen (as the concept of "insertion order" then breaks down). The "strong overlap" is far more fragile than you might think. Instead, just think "no `ORDER BY`, no order". – Jeroen Mostert Dec 13 '21 at 10:52
  • Jeroen Mostert : Thank you! Very good points and I'll keep them both in mind. – mahonya Dec 13 '21 at 10:53
  • Does this answer your question? ["order by newid()" - how does it work?](https://stackoverflow.com/questions/4979799/order-by-newid-how-does-it-work) – Charlieface Dec 13 '21 at 18:17
  • Although technically speaking, SQL Server is not *required* to calculate `NEWID()` multiple times if it is referred to only once, but in practice it always does – Charlieface Dec 13 '21 at 18:18

1 Answers1

2

ORDER BY serves two different purposes. It can be used to sort the rows which a query returns to a client. And it can be used to help to define the order to be considered in defining TOP or OFFSET ... FETCH (also window functions).

In some circumstances, as here, it's being used for both. The TOP clause finds the top rows, first taking into account the requested sort order defined by the ORDER BY clause. Since this ORDER BY is also an ORDER BY on the outermost query, it's also then used to define the order in which the rows are returned to the client1.

In other circumstances, such as when TOP or OFFSET ... FETCH are used in a subquery, the corresponding ORDER BY is only used to determine which rows are to be included in that subquery, and it has no formal influence on the order in which the final complete rows are returned.


1And many implementations will try to take advantage of this combined nature by picking an appropriate index or at least only sorting rows once to fulfil both purposes.

Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448