3

I just stumbled upon a few lines of code in a system I just started working with that I don't really get. The system has a large table that saves lots of entities with unique IDs and removes them once they're not longer needed but it never reuses them. So the table looks like this

------------------------
| id |info1|info2|info3|
------------------------
| 1  | foo1| foo2| foo3|
------------------------
| 17 | bar1| bar2| bar3|
------------------------
| 26 | bam1| bam2| bam3|
------------------------
| 328| baz1| baz2| baz3|
------------------------
etc.

In one place in the codebase there is a while loop whose purpose it is to loop through all entities in the DB and do things to them and right now this is solved like this

int lastId = fetchMaxId()
int id = 0
while (id = fetchNextId()){
  doStuffWith(id)
}

where fetchMaxId is straight forward

int fetchMaxId(){
  return sqlQuery("SELECT MAX(id) FROM Table")
}

but fetchNextId confuses me. It is implemented as

int fetchNextId(currentId, maxId){
  return sqlQuery("
    SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
  ")
}

This system has been in production for several years so it obviously works but when I tried searching for a solution to why this works I only found people saying the same thing that I already thought i knew. The order in which a MySQL DB returns the result is not easily determined and should not be relied upon so if you wan't a particular order use a ORDER BY clause. But are there some times you can safely ignore the ORDER BY? This code has worked for 12 years and continued to work through several DB updates. Are we just lucky or am I missing something here? Before I saw this code I would have said that if you called

fetchNextId(1, 328) 

you could end up with either 17 or 26 as the answer.

Some clues to why this works may be that the id column is the primary key of the Table in question and it's set to auto increment but I can't find any documentation that would explain why

fetchNextId(1, 328)

should always returns 17 when called on the table-snippet given above.

Viswanath Polaki
  • 1,357
  • 1
  • 10
  • 19
Spade
  • 310
  • 3
  • 21
  • 1
    The proper answer is probably quite involved - and a touch too technical for my pay grade - however, and in a nutshell, because the column is indexed there IS a 'natural' order imposed upon it. That said, this should not be relied upon. 'Just because it always worked that way' is not a good enough reason. – Strawberry Feb 17 '14 at 13:14
  • correct me if I am wrong, but I believe the default order of the select query is the insert order of elements. Since it's using autoincrement, deleting entries in the middle of the table don't make any difference. Now, if you delete some entry in the middle and add another with an id not generated by the autoincrement (reusing another from some previously deleted entry), then you'll have a problem. – Leo Feb 17 '14 at 13:19
  • see http://stackoverflow.com/questions/1949641/mysql-row-order-for-select-from-table-name – Leo Feb 17 '14 at 13:20
  • @Leo No. It's only the INSERT order in the absence of an INDEX (e.g. the PK) - but that absolutely should not be relied upon. – Strawberry Feb 17 '14 at 13:22
  • In MySQL the primary key and the clustered index are synonymous, so the primary key is ordered, however they are fundamentally two different things. Since you are performing a query that only selects and filters on the clustering key, then it is very likely that the index seek done will be ascending, so the result you get will be in the order required. Very likely, is not the same as guaranteed though, I have tried playing around with some data sets to see if I can force a reverse a reverse index scan, but as yet have been unsuccessful. – GarethD Feb 17 '14 at 13:38
  • In short, just add an order by, or use `MIN(ID)` and be done with it. I wouldn't lose too much sleep trying to delve into the inner workings of the query optimiser to see what kind of fragmentation, or data ranges would be required to observe different ordering of the clustered index within the query plan. – GarethD Feb 17 '14 at 13:39
  • @GarethD Thank you. That makes sense. If you rewrite it as an answer I'll accept it. – Spade Feb 19 '14 at 08:01

2 Answers2

3

The answer to your question is yes. If you look at MySQL documentation you will see that whenever a table has a primary key it has an associated index.

When looking at the documentation for indexes you will see that they will mention primary keys as a type of index.

So in case of your particular scenario:

SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1

The query will stop executing as soon as it has found a value because of the LIMIT 1. Without the LIMIT 1 it would have returned 17, 24 and 328.

However will all that said I don't think you will run into any order problems when the primary key is auto incrementing but whenever there is a scenario were the primary key is a unique employee no. instead of an auto incrementing field I would not trust the order of the result because the documentation also notes that MySQL reads sequentially, so the possibility is there that a primary key could fall out of the WHERE clause conditions and be skipped.

J2D8T
  • 827
  • 11
  • 24
3

The short answer is yes, the primary key has an order, all indexes have an order, and a primary key is simply a unique index.

As you have rightly said, you should not rely on data being returned in the order the data is stored in, the optimiser is free to return it in any order it likes, and this will be dependent on the query plan. I will however attempt to explain why your query has worked for 12 years.

Your clustered index is just your table data, and your clustering key defines the order that it is stored in. The data is stored on the leaf, and the clustering key helps the root (and intermediate notes) act as pointers to quickly get to the right leaf to retrieve the data. A nonclustered index is a very similar structure, but the lowest level simply contains a pointer to the correct position on the leaf of the clustered index.

In MySQL the primary key and the clustered index are synonymous, so the primary key is ordered, however they are fundamentally two different things. In other DBMS you can define both a primary key and a clustered index, when you do this your primary key becomes a unique nonclustered index with a pointer back to the clustered index.

In it's simplest terms you can imagine a table with an ID column that is the primary key, and another column (A), your B-Tree structure for your clustered index would be something like:

Root Node
                                +---+
                                | 1 |
                                +---+
Intermediate Nodes

                    +---+       +---+       +---+
                    | 1 |       | 4 |       | 7 |
                    +---+       +---+       +---+

Leaf
            +-----------+   +-----------+   +-----------+
    ID ->   | 1 | 2 | 3 |   | 4 | 5 | 6 |   | 7 | 8 | 9 |
    A ->    | A | B | C |   | D | E | F |   | G | H | I |
            +-----------+   +-----------+   +-----------+

In reality the leaf pages will be much bigger, but this is just a demo. Each page also has a pointer to the next page and the previous page for ease of traversing the tree. So when you do a query like:

SELECT ID, A
FROM T
WHERE ID > 5
LIMIT 1;

you are scanning a unique index so it is very likely this will be a sequential scan. Very likely is not guaranteed though.

MySQL will scan the Root node, if there is a potential match it will move on to the intermediate nodes, if the clause had been something like WHERE ID < 0 then MySQL would know that there were no results without going any further than the root node.

Once it moves on to the intermediate node it can identify that it needs to start on the second page (between 4 and 7) to start searching for an ID > 5. So it will sequentially scan the leaf starting on the second leaf page, having already identified the LIMIT 1 it will stop once it finds a match (in this case 6) and return this data from the leaf. In such a simple example this behaviour appears to be reliable and logical. I have tried to force exceptions by choosing an ID value I know is at the end of a leaf page to see if the leaf will be scanned in the reverse order, but as yet have been unable to produce this behaviour, this does not however mean it won't happen, or that future releases of MySQL won't do this in the scenarios I have tested.

In short, just add an order by, or use MIN(ID) and be done with it. I wouldn't lose too much sleep trying to delve into the inner workings of the query optimiser to see what kind of fragmentation, or data ranges would be required to observe different ordering of the clustered index within the query plan.

GarethD
  • 68,045
  • 10
  • 83
  • 123