How does non-sequential (eg, UUID/GUID) data degrade index performance?

Question

I've read in a couple of online articles pertaining to the performance of using UUID's as primary keys in MySQL - and a common theme, whether they are for-or-against is the idea that non-sequential data hurts index performance.

https://blog.codinghorror.com/primary-keys-ids-versus-guids/

The generated GUIDs should be partially sequential for best performance

https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/

Create function to rearrange UUID fields and use it (after showing how rearranging UUID can drastically improve performance)

However, I simply cannot understand how non-sequential data impacts indexes such as B-TREES, HASHES, CLUSTERED indexes.

score 5 · Accepted Answer · answered Nov 22 '16 at 23:48

5

You can add my UUID blog to your list. (It applies to MySQL equally well.)

Note that the performance problem does not occur until the index (whether clustered, or not whether BTree, hash, or other) is too big to be cached in RAM. At that point the "next" UUID you reach for (or try to insert) is unlikely to be in RAM, thereby necessitating I/O, which impacts performance.

In contrast, inserting rows keyed on a datetime, and doing so somewhat chronologically, will mostly be inserting into the same block of a BTree. This means that the "next" row is unlikely to need I/O.

I/O is the biggest factor in performance.

My blog points out how Type 1 uuids can be turned into something akin to a timestamp, thereby achieving the "locality of reference" that leads to less I/O, hence more speed. MySQL 8.0 has builtin functions that do the same thing as my Stored Functions. Still, you need Type 1 and need to call the function, in order to decrease the I/O.

answered Nov 22 '16 at 23:48

Rick James

135,179
13
127
222

1

So the issue is not 'non-sequential' but 'non-chronological'. – user207421 Nov 22 '16 at 23:59
1

@EJP - Neither -- I prefer to say "Where the next key is 'random'". Here's a different case; it works well: I have 10 widgets and I am randomly logging new rows for each of them. The `INSERTs` jump between 10 "hot spots" and insert a few row(s) at each. 10 hot spots is OK. 1 hot spot (auto_inc) is OK, but a million spots is too "random". – Rick James Nov 23 '16 at 00:04
The real question is "will my next operation find the block it needs in cache?". The fewer possible blocks, the more likely it will be in cache. – Rick James Nov 23 '16 at 00:05
When a "cache" is full, and the block you need is not there, some block needs to be bumped out of the way. That is (usually) the "least recently used". For UUIDs, that might be the block you need very soon -- "random". – Rick James Nov 23 '16 at 00:07
@RickJames. Hey, Rick. OP here. Thanks for your response. Just to check I understand, are you basically saying there is a 0 difference between using a UUID and an auto-increment primary key (theoretically) - and that the actual difference arises not from any fundamental contrast but due to the fact that when an index is large, MySQL often needs to run more IO to enforce constraints (e.g. uniqueness) - since UUIDS have less of a chance of being "in the local index" - which is in memory at that time. Correct? – AlanSTACK Nov 23 '16 at 07:32
@RickJames also, if I may, why does rearrange UUIDs timestamp component improve performance so much? Does MySQL's index system somehow only consider the first couple of characters? What leads to this behaviour? – AlanSTACK Nov 23 '16 at 07:37
"In the local index" might be a way to describe a "Cache". What is on disk is the complete index. What is in RAM is temporary copies of blocks. Randomness of access leads to the blocks more often not being the desired ones. – Rick James Nov 23 '16 at 21:27

How does non-sequential (eg, UUID/GUID) data degrade index performance?

1 Answers1