7

I've used SQL in couple databases engines from time to time several years but have little theoretical knowledge so my question could be very "noobish" for some of you. But it become important to me now so I have to ask.

Imagine table Urls with non unique column status. And for the question assume that we have large amount of rows and status has the same value in every record.

And imagine we execute many times query:

SELECT * FROM Urls ORDER BY status
  1. Do we get every time the same row order or not? If we do what will happen if we add some new rows? Does it change order or new records will be appended to end of the results? And if we don't get the same order - on what conditions depend this order?

  2. Do ROW_NUMBER() OVER (ORDER BY status) will return the same order as query above or it is based on different ordering mechanism?

MKB
  • 281
  • 2
  • 14
  • 4
    1. No. 2. Mechanism is same, but result can be different. Actual return order of unordered rows depends on query optimizer decisions and data/index physical layout. – Arvo Sep 04 '13 at 11:55

4 Answers4

11

It's very simple. If you want an ordering that you can rely upon, then you need to include enough columns in your ORDER BY clause such that the combination of all of those columns is unique for each row. Nothing else is guaranteed.

For a single table, you can usually get what you want by listing the columns that are "interesting" to sort by and then including the primary key column(s) afterwards. Since the PK, by itself, guarantees uniqueness, the whole combination is also guaranteed to uniquely define the ordering, e.g. If the Urls table has a primary key of {Site, Page, Ordinal} then the following would give you a dependable result:

SELECT * FROM Urls ORDER BY status, Site, Page, Ordinal
Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448
  • +1 Clearly if you have JOINs you have to combine the PK of all the tables, and if you have UNIONs you can add a new field with values 1, 2, 3 for each UNION. – xanatos Sep 04 '13 at 12:18
8

ORDER BY is not stable in SQL Server (nor in any other database, as far as I know). A stable sort is one that returns records in the same order that they are found in the table.

The high-level reason is quite simple. Tables are sets. They have no order. So a "stable" sort just doesn't make sense.

The lower-level reasons are probably more important. The database could be implementing a parallel sort algorithm. Such algorithms are not, by default, stable.

If you want a stable sort, then include a key column in the sorting.

This is alluded to in the documentation:

To achieve stable results between query requests using OFFSET and FETCH, the following conditions must be met:

The underlying data that is used by the query must not change. That is, either the rows touched by the query are not updated or all requests for pages from the query are executed in a single transaction using either snapshot or serializable transaction isolation. For more information about these transaction isolation levels, see SET TRANSACTION ISOLATION LEVEL (Transact-SQL).

The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I have been using SQL for almost 20 years and had to look up the definition of a STABLE sort. It is kind-of confusing. When talking about a database, you first think ACID. Therefore, everything has to be consistent (stable). In this context, you are saying the data on the pages is in the same order as results (ie - stable sort). This does not mean a ORDER BY with a correct ISOLATION LEVEL will not return the correct results. – CRAFTY DBA Sep 04 '13 at 13:12
  • I'm confused; the question asks about "deterministic", this question answers about "stable"; aren't those [two different things?](http://stackoverflow.com/questions/2313940/what-is-a-deterministic-quicksort#comment2282232_2314031) – Nate Anderson Aug 04 '16 at 05:02
0

I really love these types of questions since you can get into doing performance analysis.

First, lets create a sample [test] database with a [urls] table with a million random records.

See code below.

-- Switch databases
USE [master];
go

-- Create simple database
CREATE DATABASE [test];
go

-- Switch databases
USE [test];
go

-- Create simple table
CREATE TABLE [urls]
    (
      my_id INT IDENTITY(1, 1)
                PRIMARY KEY ,
      my_link VARCHAR(255) ,
      my_status VARCHAR(15)
    );
go

-- http://stackoverflow.com/questions/1393951/what-is-the-best-way-to-create-and-populate-a-numbers-table

-- Load table with 1M rows of data 
;
WITH    PASS0
          AS ( SELECT   1 AS C
               UNION ALL
               SELECT   1
             ),           --2 rows
        PASS1
          AS ( SELECT   1 AS C
               FROM     PASS0 AS A ,
                        PASS0 AS B
             ),  --4 rows
        PASS2
          AS ( SELECT   1 AS C
               FROM     PASS1 AS A ,
                        PASS1 AS B
             ),  --16 rows
        PASS3
          AS ( SELECT   1 AS C
               FROM     PASS2 AS A ,
                        PASS2 AS B
             ),  --256 rows
        PASS4
          AS ( SELECT   1 AS C
               FROM     PASS3 AS A ,
                        PASS3 AS B
             ),  --65536 rows
        PASS5
          AS ( SELECT   1 AS C
               FROM     PASS4 AS A ,
                        PASS4 AS B
             ),  --4,294,967,296 rows
        TALLY
          AS ( SELECT   ROW_NUMBER() OVER ( ORDER BY C ) AS Number
               FROM     PASS5
             )
    INSERT  INTO urls
            ( my_link ,
              my_status
            )
            SELECT 
      -- top 10 search engines + me
                    CASE ( Number % 11 )
                      WHEN 0 THEN 'www.ask.com'
                      WHEN 1 THEN 'www.bing.com'
                      WHEN 2 THEN 'www.duckduckgo.com'
                      WHEN 3 THEN 'www.dogpile.com'
                      WHEN 4 THEN 'www.webopedia.com'
                      WHEN 5 THEN 'www.clusty.com'
                      WHEN 6 THEN 'www.archive.org'
                      WHEN 7 THEN 'www.mahalo.com'
                      WHEN 8 THEN 'www.google.com'
                      WHEN 9 THEN 'www.yahoo.com'
                      ELSE 'www.craftydba.com'
                    END AS my_link ,

      -- ratings scale
                    CASE ( Number % 5 )
                      WHEN 0 THEN 'poor'
                      WHEN 1 THEN 'fair'
                      WHEN 2 THEN 'good'
                      WHEN 3 THEN 'very good'
                      ELSE 'excellent'
                    END AS my_status
            FROM    TALLY AS T
            WHERE   Number <= 1000000
go

Second, we always want to clear the buffers and cache when doing performance analysis in our test environment. Also, we want to turn on statistics I/O and time to compare the results.

See code below.

-- Show time & i/o
SET STATISTICS TIME ON
SET STATISTICS IO ON
GO

-- Remove clean buffers & clear plan cache
CHECKPOINT 
DBCC DROPCLEANBUFFERS 
DBCC FREEPROCCACHE
GO

Third, we want to try the first TSQL statement. Look at the execution plan and capture the statistics.

-- Try 1
SELECT * FROM urls ORDER BY my_status

/*
Table 'urls'. Scan count 5, logical reads 4987, physical reads 1, read-ahead reads 4918, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 3166 ms,  elapsed time = 8130 ms.
*/

enter image description here

Fourth, we want to try the second TSQL statement. Do not forget to clear the query plan cache and buffers. If you do not, the query takes less than 1 sec since most of the information is in memory. Look at the execution plan and capture the statistics.

-- Try 2
SELECT ROW_NUMBER() OVER (ORDER BY my_status) as my_rownum, * FROM urls

/*
Table 'urls'. Scan count 5, logical reads 4987, physical reads 1, read-ahead reads 4918, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 3276 ms,  elapsed time = 8414 ms.
*/

enter image description here

Last but not least, here is the fun part, the performance analysis.

1 - We can see that the second plan is a super set of the first. So both plans scan the clustered index and sort the data. Parallelism is used to put the results together.

2 - The second plan / query needs to calculate the row number. It segments the data and calculates this scalar. Therefore, we end up with two more operators in the plan.

It is not surprising that the first plan runs in 8130 ms and the second plan runs in 8414 ms.

Always look at the query plan. Both estimated and actual. They tell you want the engine is planning to do and what it actually does.

In this example, two different TSQL statements come up with almost identical plans.

Sincerely

John

www.craftydba.com

Ardalan Shahgholi
  • 11,967
  • 21
  • 108
  • 144
CRAFTY DBA
  • 14,351
  • 4
  • 26
  • 30
  • 1
    I commend the effort, but it doesn't really answer the question about whether ORDER BY (whether on a result set on in a ROW_NUMBER function) is deterministic and if not what factors influence it, rather than a performance comparison of ROW_NUMBER and ORDER BY. However I'd also add that while I'm not surprised ORDER BY Outperforms ROW_NUMBER, a single test with a 3.37% variance doesn't strike me as conclusive proof! – GarethD Sep 04 '13 at 14:16
  • Re-reading the question, I guess I missed the mark. The user really wanted to know two questions; 1 - added records show up in the results and 2- is ORDER BY different than ROW_NUMBER when it comes to the transaction engine. The first question depends upon ISOLATION level. The second question is answered by my query analysis. They obvious use a similar plan. – CRAFTY DBA Sep 04 '13 at 15:19
  • Well I agree with you @user2577687 that testing by yourself is a good practise many times. But there are also flaws of it. Sometimes asking is much more optimal. On example you can write question and wait for an answer in couple minutes and test could get much longer. And time is sometimes very vital thing. Especially if you do something not for yourself but for employer :-) I'm running procedure which update in smaller packs over 200 mln records. It runs slower, and slower. Meanwhile I'm trying to find a better one. Testing will slow me down even more because it demands to stop current query. – MKB Sep 04 '13 at 15:24
0

The general answer to any sql question "what order does this output in" is "whatever the server feels like, and it may not be the same from query to query" unless you have specifically requested a order.

Even something simple like 'select top 1000 myColumn from myTable' can come back with any rows in any order; eg the server can use parallel threads and the first thread to start returning results started reading in the middle of the table, or an index was used which included myColumn, so you got the rows with the alphabetically first productName (this time; last time the index had different stats so it selected a different index and gave you the 1000 oldest transactions)...

It is even theoretically possible for the server to say "i had these 10 pages in my memory cache that match your query, i'll pass you these ones while i wait for the disk to return the rest...

Andrew Hill
  • 1,921
  • 1
  • 25
  • 31