109

We're considering using UUID values as primary keys for our MySQL database. The data being inserted is generated from dozens, hundreds, or even thousands of remote computers and being inserted at a rate of 100-40,000 inserts per second, and we'll never do any updates.

The database itself will typically get to around 50M records before we start to cull data, so not a massive database, but not tiny either. We're also planing to run on InnoDB, though we are open to changing that if there is a better engine for what we're doing.

We were ready to go with Java's Type 4 UUID, but in testing have been seeing some strange behavior. For one, we're storing as varchar(36) and I now realize we'd be better off using binary(16) - though how much better off I'm not sure.

The bigger question is: how badly does this random data screw up the index when we have 50M records? Would we be better off if we used, for example, a type-1 UUID where the leftmost bits were timestamped? Or maybe we should ditch UUIDs entirely and consider auto_increment primary keys?

I'm looking for general thoughts/tips on the performance of different types of UUIDs when they are stored as an index/primary key in MySQL. Thanks!

Patrick Lightbody
  • 4,424
  • 2
  • 28
  • 38

11 Answers11

93

At my job, we use UUID as PKs. What I can tell you from experience is DO NOT USE THEM as PKs (SQL Server by the way).

It's one of those things that when you have fewer than 1000 records, it's ok, but when you have millions, it's the worst thing you can do. Why? Because UUID are not sequential, so everytime a new record is inserted, MSSQL needs to go look at the correct page to insert the record in, and then insert the record. The really ugly consequence with this is that the pages end up all in different sizes and they end up fragmented, so now we have to do de-fragmentation periodic.

When you use an autoincrement, MSSQL will always go to the last page, and you end up with equally sized pages (in theory), so the performance to select those records is much better (also because the INSERTs will not block the table/page for so long).

However, the big advantage of using UUID as PKs is that if we have clusters of DBs, there will not be conflicts when merging.

I would recommend the following model:

  1. PK INT Identity
  2. Additional column automatically generated as UUID.

This way, the merge process is possible (UUID would be your REAL key, while the PK would just be something temporary that gives you good performance).

NOTE: That the best solution is to use NEWSEQUENTIALID (like I was saying in the comments), but for legacy app with not much time to refactor (and even worse, not controlling all inserts), it is not possible to do. But indeed as of 2017, I'd say the best solution here is NEWSEQUENTIALID or doing Guid.Comb with NHibernate.

Pang
  • 9,564
  • 146
  • 81
  • 122
Kat Lim Ruiz
  • 2,425
  • 2
  • 26
  • 32
  • I dont really know what those terms mean, but the fact is that the indexes need to be reindexed every month. If what you mention eliminates the reindexing task, I don't know but I can ask. – Kat Lim Ruiz Jul 06 '12 at 00:50
  • 3
    Something that I've been thinking is that this may not work that well for parent-child relationships. In this case, I think you have to add in the child table: parent-pk, parent-guid. Otherwise you could lose references between databases. I haven't thought of this too much, nor done any example, but this may be needed – Kat Lim Ruiz Jul 11 '13 at 15:17
  • 4
    @KatLimRuiz in sql server you can use the NEWSEQUENTIALID() http://technet.microsoft.com/en-us/library/ms189786.aspx to avoid the performance issue – giammin Aug 29 '13 at 16:09
  • Indeed, but NEWSEQUENTIALID only works as DEFAULT. So you need to design your whole DAL around this, which is ok for new projects but not so easy for big legacy – Kat Lim Ruiz Aug 30 '13 at 03:58
  • @KatLimRuiz genius. That's a great compromise – jmgunn87 Oct 14 '14 at 16:21
  • The answer assumes that all UUIDs are random (they're not) and that autoincrement is a clear alternative (it's not). autoincrement is a security leak waiting to happen since you can guess next and previous IDs. They also don't allow for database-level uniqueness. UUIDs also have the advantage that you can generate them in the app instead of the database and still ensure safe uniqueness. See: UUIDv1 for "sequential" ids. – Miles Elam Dec 04 '19 at 23:22
  • Note: NEWSEQUENTIALID is SQL Server-specific. The question specifically asked about MySQL. – Miles Elam Dec 05 '19 at 00:04
  • @MilesElam you are right indeed, so long this passed me by lol. Yet I'd say MySql has similar functions on its own. – Kat Lim Ruiz Dec 06 '19 at 14:28
  • If you are using ints as PKs, how can you manage things like sharding? How can you retrieve a user by its ID if your DB is distributed across several shards? The UUID will make that possible, but an INT will not and will inevitably lead to conflicts if the same INT is persisted on two or more shards. What is the best way to deal with such situations? Could using a UNIQUE KEY on the UUID field be a good solution while maintaining an INT PRIMARY KEY to boost JOINs within the single shard? Thank you! – tonix Oct 11 '20 at 16:36
  • well in my case, I did not do sharding per se, but it was thought to allow a similar model (in a more primitive way with separate databases). Indeed you are on the right track. However, I also wrote that the best solution is to use NEWSEQUENTIALID as it would serve both goals with a single field. – Kat Lim Ruiz Oct 12 '20 at 04:04
40

A UUID is a Universally Unique ID. It's the universally part that you should be considering here.

Do you really need the IDs to be universally unique? If so, then UUIDs may be your only choice.

I would strongly suggest that if you do use UUIDs, you store them as a number and not as a string. If you have 50M+ records, then the saving in storage space will improve your performance (although I couldn't say by how much).

If your IDs do not need to be universally unique, then I don't think that you can do much better then just using auto_increment, which guarantees that IDs will be unique within a table (since the value will increment each time)

Dancrumb
  • 26,597
  • 10
  • 74
  • 130
  • Our reason for considering UUIDs is because in some situations we will have 1000+ machines dumping data in, and I didn't want us getting blocked on central ID generation - though perhaps I'm being pennywise, pound foolish :) – Patrick Lightbody Mar 02 '10 at 17:23
  • 2
    Interesting point; this would parallelize the generation of the keys. I believe that this would increase the performance of key generation. However, you're choosing INSERT performance over SELECT performance if you use VARCHAR for storing the UUID. You most definitely should choose VARBINARY for storing to ensure SELECT performance. The extra step *may* impact INSERT performance, but you'll be paid off with the SELECT performance improvement. – Dancrumb Mar 02 '10 at 17:40
  • 12
    We ended up doing some benchmarking on real data and GUIDs w/o keys was pretty fast, GUIDs w/ keys was horrible (even when stored as BINARY), and int w/ AUTO_COMPLETE was the fastest. I think in our case, we were indeed missing the forest from the trees, as the sequence generation seemed inconsequential compared to the cost of storing more data + having a really crappy BTREE due to the randomness of the GUIDs – Patrick Lightbody Mar 06 '10 at 14:15
  • storing uuid as number? what do you mean by that, @Dancrumb? how could it be? Uuid is a 36 character-long. – Chamnap Jun 05 '12 at 14:47
  • That'd be for you to decide. Storing a UUID as text would require 36 bytes, as a number it requires 16 bytes. That's two BIGINTs. You could quite easily split the UUID over two columns. For 50M+ columns, you'd be saving 763MB... this will help you make better use of your caches. – Dancrumb Jun 05 '12 at 15:46
  • 1
    store as a number means storing in binary format? but binary format is unreadable for human. It is slow because large bytes of uuid primary key? If it's, then I could store auto-increment with another column for uuid. Then, performance won't suffer. Am i right? – Chamnap Jun 07 '12 at 14:19
  • 4
    Strictly speaking, UUID is *universally* unique, meaning that it will never appear anywhere else in the world. You only need this if you're sharing your data publically. As for storing a UUID as a number, I don't mean in `binary` format. I mean as a 128 bit number, rather than a 288 bit string. For instance, the word 'hello' in ASCII is `68 65 6C 6C 6F`, which is the number 448,378,203,247. Storing the string '68656C6C6F' requires 10 bytes. The number 448,378,203,247 requires only 5. All in all, unless you *really* need the first U in UUID, you can't do much better than `auto_increment` – Dancrumb Jun 07 '12 at 14:37
  • Thanks for your comment. I don't have any knowledge to covert from uuid string (36 characters) into 128 bit number. How to do this job? – Chamnap Jun 08 '12 at 02:17
  • @Dancrumb, here is the question, http://stackoverflow.com/questions/10950202/how-to-store-uuid-as-number. I hope you answer this question. Thanks :) – Chamnap Jun 08 '12 at 13:57
  • @PatrickLightbody Thanks for your question and benchmarking regarding GUID as primary keys. I was surprised that even UUID v1 has poor index performance (vs UUID v4). I referred to your comment in a [similar question](http://stackoverflow.com/questions/13145988/will-uuid-as-primary-key-in-postgresql-give-bad-index-performance/13146662#comment21592003_13146662) – TaiwanGrapefruitTea Mar 08 '13 at 17:06
  • since version 8.0 the UUID performance problem is solved via UUID_AS_BIN ... – Vampirasu Oct 28 '19 at 07:39
28

Something to take into consideration is that Autoincrements are generated one at a time and cannot be solved using a parallel solution. The fight for using UUIDs eventually comes down to what you want to achieve versus what you potentially sacrifice.

On performance, briefly:

A UUID like the one above is 36 characters long, including dashes. If you store this VARCHAR(36), you're going to decrease compare performance dramatically. This is your primary key, you don't want it to be slow.

At its bit level, a UUID is 128 bits, which means it will fit into 16 bytes, note this is not very human readable, but it will keep storage low, and is only 4 times larger than a 32-bit int, or 2 times larger than a 64-bit int. I will use a VARBINARY(16) Theoretically, this can work without a lot of overhead.

I recommend reading the following two posts:

I reckon between the two, they answer your question.

Kyle Rosendo
  • 25,001
  • 7
  • 80
  • 118
  • 2
    Actually, I read both those articles prior to posting this question, and I still didn't have a good answer here. For example, neither talk about type 1 vs type 4 UUIDS :( – Patrick Lightbody Mar 02 '10 at 17:27
  • Fair that, I updated my answer a touch. I don't think it provides too much extra insight however. – Kyle Rosendo Mar 02 '10 at 17:30
  • @Patrick: you put too many different topics into your question. –  Mar 02 '10 at 18:02
  • 1
    9 years later, but it should also be noted for posterity that unlike integer IDs, apps can generate UUIDs safely, removing the generation from the database entirely. Manipulation of the UUIDs for performance optimization (timestamp-based but modified so that they can be naively sorted) is notably easier in just about any language other than SQL. Luckily almost all databases today (MySQL included) handle UUID primary keys much better than they used to. – Miles Elam Dec 05 '19 at 00:42
4

I tend to avoid UUID simply because it is a pain to store and a pain to use as a primary key but there are advantages. The main one is they are UNIQUE.

I usually solve the problem and avoid UUID by using dual key fields.

COLLECTOR = UNIQUE ASSIGNED TO A MACHINE

ID = RECORD COLLECTED BY THE COLLECTOR (auto_inc field)

This offers me two things. Speed of auto-inc fields and uniqueness of data being stored in a central location after it is collected and grouped together. I also know while browsing the data where it was collected which is often quite important for my needs.

I have seen many cases while dealing with other data sets for clients where they have decided to use UUID but then still have a field for where the data was collected which really is a waste of effort. Simply using two (or more if needed) fields as your key really helps.

I have just seen too many performance hits using UUID. They feel like a cheat...

  • 1
    This is actually a pretty neat, and sort of sounds obvious once its mentioned. The question is, how expensive is that dual key when used in big joins or whatever. – Shayne Jun 06 '22 at 13:40
3

Instead of centrally generating unique keys for each insertion, how about allocating blocks of keys to individual servers? When they run out of keys, they can request a new block. Then you solve the problem of overhead by connecting for each insert.

Keyserver maintains next available id

  • Server 1 requests id block.
  • Keyserver returns (1,1000)
    Server 1 can insert a 1000 records until it needs to request a new block
  • Server 2 requests index block.
  • Keyserver returns (1001,2000)
  • etc...

You could come up with a more sophisticated version where a server could request the number of needed keys, or return unused blocks to the keyserver, which would then of course need to maintain a map of used/unused blocks.

Bouke Versteegh
  • 4,097
  • 1
  • 39
  • 35
  • Interesting suggestion in theory. This would be complex to manage in practice. A more practical solution would probably be the answer posed by schworak. – Simon East Oct 28 '17 at 21:24
3

I realize this question is rather old but I did hit upon it in my research. Since than a number of things happened (SSD are ubiquitous InnoDB got updates etc).

In my research I found this rather interesting post on performance:

claiming that due to the randomness of a GUID/UUID index trees can get rather unbalanced. in the MariaDB KB I found another post suggested a solution. But since than the new UUID_TO_BIN takes care of this. This function is only available in MySQL (tested version 8.0.18) and not in MariaDB (version 10.4.10)

TL;DR: Store UUID as converted/optimized BINARY(16) values.

theking2
  • 2,174
  • 1
  • 27
  • 36
2

I would assign each server a numeric ID in a transactional manner. Then, each record inserted will just autoincrement its own counter. Combination of ServerID and RecordID will be unique. ServerID field can be indexed and future select performance based on ServerID (if needed) may be much better.

Nikolai
  • 21
  • 1
2

The short answer is that many databases have performance problems (in particular with high INSERT volumes) due to a conflict between their indexing method and UUIDs' deliberate entropy in the high-order bits. There are several common hacks:

  • choose a different index type (e.g. nonclustered on MSSQL) that doesn't mind it
  • munge the data to move the entropy to lower-order bits (e.g. reordering bytes of V1 UUIDs on MySQL)
  • make the UUID a secondary key with an auto-increment int primary key

... but these are all hacks--and probably fragile ones at that.

The best answer, but unfortunately the slowest one, is to demand your vendor improve their product so it can deal with UUIDs as primary keys just like any other type. They shouldn't be forcing you to roll your own half-baked hack to make up for their failure to solve what has become a common use case and will only continue to grow.

StephenS
  • 1,813
  • 13
  • 19
1

What about some hand crafted UID? Give each of the thousands of servers an ID and make primary key a combo key of autoincrement,MachineID ???

MindStalker
  • 14,629
  • 3
  • 26
  • 19
  • I've thought about that and might need to run some benchmarks. Even a temporary local sequence on each of the 1000 machines, combined with timestamp, might be a enough. Ex: machine_id + temp_seq + timestamp – Patrick Lightbody Mar 02 '10 at 18:11
  • Is it possible to have a temp_sequence that resets every timestamp tick? I'm not sure. – MindStalker Mar 02 '10 at 18:45
1

Since the primary key is generated decentralised, you don't have the option of using an auto_increment anyway.

If you don't have to hide the identity of the remote machines, use Type 1 UUIDs instead of UUIDs. They are easier to generate and can at least not hurt the performance of the database.

The same goes for varchar (char, really) vs. binary: it can only help matters. Is it really important, how much performance is improved?

0

The main case where UUIDs cause miserable performance is ...

When the INDEX is too big to be cached in the buffer_pool, each lookup tends to be a disk hit. For HDD, this can slow down the access by 10x or worse. (No, that is not a typo for "10%".) With SSDs, the slowdown is less, but still significant.

This applies to any "hash" (MD5, SHA256, etc), with one exception: A type-1 UUID with its bits rearranged.

Background and manual optimization: UUIDs

MySQL 8.0: see UUID_TO_BIN() and BIN_TO_UUID()

MariaDB 10.7 carries this further with its UUID datatype.

Rick James
  • 135,179
  • 13
  • 127
  • 222