26

Whenever I design a database I automatically start with an auto-generating GUID primary key for each of my tables (excepting look-up tables)

I know I'll never lose sleep over duplicate keys, merging tables, etc. To me it just makes sense philosophically that any given record should be unique across all domains, and that that uniqueness should be represented in a consistent way from table to table.

I realize it will never be the most performant option, but putting performance aside, I'd like to know if there are philosophical arguments against this practice?

Based on the responses let me clarify:

I'm talking about consistently using a GUID surrogate key as a primary key- irrespective of whether and how any natural or sequential keys are designed on a table. These are my assumptions:

  1. Data integrity based on natural keys can be designed for, but not assumed.
  2. A primary key's function is referential integrity, irrespective of performance, sequencing, or data.
Yarin
  • 173,523
  • 149
  • 402
  • 512

8 Answers8

18

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table.

What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to. The main reason for this is indeed performance, which will come and bite you down the road... (it will, trust me - just a matter of time) - plus also a waste of resources (disk space and RAM in your SQL Server machine) which is really not necessary.

You really need to keep two issues apart:

1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

Marc

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • @marc_s - I've yet to see any such analysis on COMB guids (not newsequentialid) vs integers as the clustered key. Yes it eats up more space but frankly, space isn't really the primary constraint on most database systems. – Thomas Jun 16 '10 at 05:03
  • @THomas: it eats a lot more space, and **not** just on disk - but also in your server's main memory (RAM) - something a lot of folks don't take into account! RAM isn't quite as cheap as disk space, and more space = more I/O = less performance (generalized and simplified) – marc_s Jun 16 '10 at 05:11
  • @marc_s - Adding an int clustered key on all your tables (along with a GUID pk) would not solve the problem of merging databases. You would still deal with the horrid nightmare of syncing identity columns. In addition, you would be eating up 20 bytes of index space instead of 16 with just a guid PK since you'll need a unique constraint on the guid column. I'm not convinced that an additional int clustered key that you have to never use is better than just picking a surrogate pk strategy (comb guids or int pks). – Thomas Jun 16 '10 at 05:14
  • 1
    @marc_s - Again, 25MB vs 106MB on a 4-8 GB system is peanuts and that's assuming your table has a million rows. – Thomas Jun 16 '10 at 05:15
  • @Thomas: yes, but that's just one single table... imagine you have 50, 100 of those - and not just 1 million, but 100, 500 million rows. It's all a matter of scale - if you only deal with 50'000 rows - go nuts and do whatever you like - (almost) no problem. But if you need to scale, you better know what the heck you're doing.... – marc_s Jun 16 '10 at 05:18
  • Also, 106MB vs. 25 MB also means you most like have 4 times as much I/O just on that one single table. Multiply that by 50, 100 tables in your system, and by 50, 200, 500, 5000 users on your system - it gets out of hand very quickly. – marc_s Jun 16 '10 at 05:20
  • @marc_s - I would definitely agree that the strategy you use for medium to large database (one expect to hit 100GB for medium or 1TB for large) is a different kettle of fish (even in your decision to use surrogates of any kind). In fact, I would also say that if the system is a single install design vs a product database (where you might get customer's merge databases) that too should effect your decision. I've seen guid pk systems grow well over 100 GB with no problems in performance. However, I'll grant you I've never seen anything in the terabyte range with guid pks. – Thomas Jun 16 '10 at 05:23
  • @marc_s - How do you save that 4x IO if you have both a int clustered key and a guid key and are always searching on the guid key? Using both an int and a guid key throughout the database seems like a crazy design. Am I misreading your suggestion? – Thomas Jun 16 '10 at 05:33
  • @James Westgate: of course, even index pages must be loaded into memory, and if you're wasting space on those pages, you're wasting space in your server memory.... – marc_s Jun 16 '10 at 06:45
  • 1
    @marc_s - Yes I think this is an excellent point, thanks- although I would still put it in the performance, not philosophical, camp – Yarin Jun 16 '10 at 15:01
  • @marc yasher koach to you on this fabulous answer – JOE SKEET Jan 14 '11 at 22:43
15

Jeff Atwood talks about this in great detail:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html

Guid Pros:
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway

Guid Cons:
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

Matt
  • 74,352
  • 26
  • 153
  • 180
Evan Nagle
  • 5,133
  • 1
  • 26
  • 24
  • 2
    If you don't do replication, then I've never seen a use of GUIDs to merge records because it's the contents that usually matter first and foremost. Additionally I rarely merge due to foreign keys, I'd love to be able to on a regular basis. I add them if my database is distributed or will be replicated, otherwise I rely on date timestamps. – databyte Jun 16 '10 at 04:54
  • anything not in your processors native int size will run massively slower for index operations. Also have to take into account that GUIDS generation typically has to lock something more global than a regular local sequence update. – Evan Carroll Jun 16 '10 at 16:09
  • Good first link, and I'd add http://krow.livejournal.com/497839.html Seems like the cons all still in the performance camp, and that's becoming less relevant... – Yarin Jun 16 '10 at 16:10
  • @Yarin, the performance camp is becoming less relevant? Are you kidding? Performance is critical for databases. It is the second most important thing after data integrity. – HLGEM Dec 15 '10 at 20:09
  • 1
    @HLGEM- what I mean is, the choice of GUID vs INT is increasingly less relevant to the performance of a database- due to improvements in hardware/software- not that performance in and of itself is less relevant. – Yarin Oct 13 '11 at 15:48
  • @Yarin I hope it is implied though that the detail we store and in the end demand in query is going to demand any hardware performance gain right back for the majority of us. (When I get better hardware, it often goes to address backlogged features or the performance degradation of deployed features). – Christopher McGowan May 31 '16 at 15:54
4

Adding to ewwwn:

Pros

  • It makes it nearly impossible for developers to "accidentally" expose the surrogate key to users (unlike integers where it happens almost all the time).
  • Makes merging databases several orders of magnitude simpler than dealing with identity columns.

Cons

  • Fatter. The real problem with it being fatter is that it eats up more space per page and more space in your indexes making them slower. The additional storage space of Guids is frankly irrelevant in today's world.
  • You absolutely must be careful about how new values are created. Truly random values do not index well. You are compelled to use a COMB guid or some variant that adds a sequential element to the guid.
Thomas
  • 63,911
  • 12
  • 95
  • 141
  • "truly uber-unique values" : INT IDENTITY are probably the utmost unique you can get - and they handle REALLY well as clustering keys. It's not the uniqueness of GUIDs that cause the issues - it's the randomness. – marc_s Jun 16 '10 at 05:01
  • 1
    @marc_s - Bad choice of words on my part. By uber-unique I mean across time and space. "Random" is a more apt terms and I'll adjust. – Thomas Jun 16 '10 at 05:04
  • It seems to me that it makes it several orders of magnitude easier to create duplicate entities when merging databases -- not good. I like the point about making it harder to expose surrogates, though :) – onedaywhen Jun 16 '10 at 08:33
  • @onedaywhen - RE: creating duplicates, if by that you mean creating duplicates of the data itself, that is a different matter. No matter what surrogate key strategy you use, you *must* have a business key on other columns in the table for reasons that have nothing to do with database merges. – Thomas Jun 16 '10 at 14:49
  • Another point then: if you shun surrogate keys entirely and always use natural keys then it makes it impossible to expose surrogate keys to users :) – onedaywhen Jun 18 '10 at 08:43
  • "you must have a business key on other columns in the table for reasons that have nothing to do with database merges" -- sure but in my experience "surrogate key as primary key" advocates more often than not don't bother with candidate keys, business keys or otherwise, in ignorance or otherwise. – onedaywhen Jun 18 '10 at 08:45
  • @onedaywhen - RE: business key. No question the most common mistake with surrogate keys, regardless of type, is not having a unique constraint on some other combination of columns. Avoiding surrogate keys entirely is often impractical but forcing the use on all tables is also often not necessary but I see it all the time. It is as if the developer feels that adding a surrogate key somehow obviates the need for any thoughts about uniqueness. – Thomas Jun 18 '10 at 14:27
4

You still implement the natural key of each table as well don't you? - GUID keys alone obviously won't prevent duplicate data, redundancy and consequent loss of data integrity.

Assuming that you do enforce other keys then adding GUIDs to every table without exception is probably just adding unnecessary complexity and overhead. It doesn't really make it easier to merge data in different tables because you still have to modify/de-duplicate the other key(s) of the table anyway. I suggest you should evaluate the use of a GUID surrogate on a case-by-case basis. Having a blanket rule for every table isn't necessary or helpful because every table models a different thing after all.

nvogel
  • 24,981
  • 1
  • 44
  • 82
  • Agreed that as a DB designer you always need to assess and enforce data integrity on a case by case basis, and that one should be wary of blanket rules. But in my view there's value in the GUID surrogate providing a common interface approach to recognizing record uniqueness. We can try to enforce data integrity, but shouldn't assume it, and the GUID key can at least provide fool-proof record uniqueness every time, even if the natural key rules have been violated. – Yarin Jun 16 '10 at 14:46
  • All keys are "fool-proof" if they are enforced with a uniqueness constraint. I can't agree that we should not assume data integrity! The first responsibility of the database designer is to create the correct data model - one that ensures the relevant facts about the business are recorded in a way that avoids inaccurate results. Unless and until you implement the natural keys you have not achieved that. A guid will not help you. – nvogel Jun 16 '10 at 15:10
  • @David- I'm generally wary of assuming anything, let alone real-world data integrity- It's impossible to predict how a database will be used/abused by software devs and future designers/admins once the original author cashes in his 401k. Natural keys cannot be fool-proof, hence the appeal of a surrogate key that is arbitrary i.e. not reliant on the data- So guess I'm talking about guaranteeing referential integrity, instead of assuming data integrity – Yarin Jun 16 '10 at 15:37
  • "It's impossible to predict how a database will be used/abused by software devs and future designers/admins". Including removing the GUID column. Referential integrity is largely pointless, if there's no integrity to the data. The point of creating unique key constraints/indexes on natural keys is that the DBMS will enforce that uniqueness for the records identified by that key - you're not **assuming** real world data integrity, you're **enforcing** it within the database. –  Jun 16 '10 at 16:18
  • @Yarin, I agree with Mark. You aren't really enforcing referential integrity if you aren't also enforcing the natural keys. By allowing duplicate data you are creating the potential for deletion anomalies which means the referencing row could actually point to the wrong version of the data while the row that should have been referenced gets deleted. Enforcing the keys of interest to the business users ought to come before RI and is generally more important. – nvogel Jun 16 '10 at 18:19
  • @Mark, David, OK I hear you on data integrity. But regardless of how data integrity is enforced, are we doing any damage by adding the surrogate key? – Yarin Jun 16 '10 at 19:20
  • @Yarin, if you do it every time you are possibly adding complexity and some performance overhead where it isn't really needed. For instance you are probably creating more joins in queries than would otherwise be the case (because you have to join to retrieve the meaningful attribute from the other table). Also if surrogates become an excuse not normalize your database then you are potentially creating redundancy, which is obviously bad for all the usual reasons. – nvogel Jun 16 '10 at 19:29
  • @Yarin: What David said. An additional thought occurs to me, though: if you're implementing an Object/Relational Mapping (ie. using an RDBMS as a bucket to store object-oriented data), the usual relational data integrity rules just don't apply. –  Jun 17 '10 at 11:33
  • @Yarin. I don't think you understand the distinction: GUIDs or any Surrogate Id, merely provide uniqueness to the row, it does not in any way ensure that the row is not a duplicate; it happily makes non-unique rows unique; it is a false sense of security. That is not **enforced** uniqueness, that is blindly overlooking the uniqueness problem. Unique indices on the natural key columns (whether you have surrogates or not) enforce uniqueness. – PerformanceDBA Dec 25 '10 at 08:25
3

Simple answer: it's not relational.

The record (as defined by the GUID) may be unique, but none of the associated attributes can be said to be occuring uniquely with that record.

Using a GUID (or any purely surrogate key) is no more relational than declaring a flat file to be relational, on the basis that each record can be identified by its row number.

1

A potentially big reason, but one often not thought of, is if you might have to provide compatibility with an Oracle database in the future.

Since Oracle doesn't have a uniqueid column data type, it can lead to a bit of a nightmare when you have two different data types for the same primary key across two different databases, especially when an ORM is involved.

Coxy
  • 8,844
  • 4
  • 39
  • 62
  • Interesting- So Oracle uses ints for all primary keys? – Yarin Oct 13 '11 at 16:15
  • I'm sure Oracle can use anything you like as a primary key... it's just that they don't have that special type to represent a GUID natively. – Coxy Oct 14 '11 at 00:25
1

I wonder why there's no standard "miniGUID" type? It would seem that performing a decent hash on a GUID should yield a 64-bit number which would have a trivial probability of duplication in any universe which doesn't have a billion or more things in it. Since the universe in which most GUID/miniGUID identifiers are used will never grow beyond a million things, much less a billion, I would think a smaller 8-byte miniGuid would be very useful.

That would not, of course, suggest that it should be used as a clustered index; that would greatly impede performance. Nonetheless, an 8-byte miniGUID would only waste a third the space of a full GUID (when compared to a 4-byte index).

supercat
  • 77,689
  • 9
  • 166
  • 211
0

I can see the case for a given application's or enterprise's own identifiers to be unique and be represented in a consistent way across all its own domains (i.e. because they may span more than one database) but a GUID is overkill for these purposes. I guess they are popular because they are available out of the box and designing and implementing an 'enterprise key' takes time and effort. The rule when designing an artifical identifier is to make it as simple as possible but no simpler. IDENTITY is too simple, a GUID isn't simple enough.

Entities that exist outside of the application/enterprise usually have their own identifiers (e.g. a car has a VIN, a book has an ISBN, etc) maintained by an external trusted source and in such cases the GUID adds nothing. So I guess the philosphical argument against I'm getting at here is that using a artifical identifier on every table is unnecessary.

onedaywhen
  • 55,269
  • 12
  • 100
  • 138
  • 1
    I would caution against relying on external identifiers. The first time you have to enter that recovered stolen vehicle with the VIN's filed of, or that self published "book" in your collection that does not have an ISBN, you will be heading down the slippery slope of "making up your own VIN, ISBN, ..." entries. – Peter Stuer Jun 16 '10 at 09:01
  • If a self published "book" has a different identifier as a published book book then they wouldn't exist in the same base table (perhaps combined in a `VIEW`). – onedaywhen Jun 16 '10 at 09:43
  • @Peter Stuer: are there cases for artificial identifiers? Yes. Does every table need an artificial identifier? No. – onedaywhen Jun 16 '10 at 09:45
  • 1
    I'm down voting this (Nvmd I can't, not enough rep) because you're making assumptions about domain integrity that don't apply in the real world. Assuming that a record will only be referenced within its current domain is not practical. – Yarin Jun 16 '10 at 14:52
  • @Yarin: I think you miss my point: if an entity is to be identified outside the application/entity then EITHER it will already have an universal identifier of the UPC/EAN/ISBN/VIN variety maintained by a trusted source and therefore a GUID is redundant OR your application is to be the trusted source and therefore a GUID is not the optimum format for a key. – onedaywhen Jun 18 '10 at 08:39
  • @Yarin: "Assuming that a record will only be referenced within its current domain" -- in DBMS terms, 'record' and 'reference' have very specific meanings which I assume (hope!) you didn't intend to imply. – onedaywhen Jun 18 '10 at 08:40
  • @onedaywhen: I get what you're saying now- good point, chewing on it – Yarin Jun 19 '10 at 14:32
  • @Yarin: If one were to create 64-bit random mini-GUID values by hashing GUIDs, how likely is it that there would ever be a conflict in a domain which is expected to have no more than e.g. a few hundred thousand records? Even if the domain grows a thousand-fold beyond what's expected a duplicate would be unlikely (if the domain grows too big, the Birthday Principle says duplication would become likely, but a lot of domains would never grow anywhere near large enough for that to be a danger). – supercat Sep 06 '10 at 22:25