97

In a fairly animated discussion in my team I was made to think what most people like as primary keys. We had the following groups-

  1. Int/ BigInt which autoincrement are good enough primary keys.
  2. There should be at least 3 columns that make up the primary key.
  3. Id, GUID and human readable row identifiers all should be treated differently.

What's the best approach for PKs? It would be awesome if you could justify your opinion. Is there a better approach that the above?

EDIT: Anyone has a simple sample/algorithm to generate human readable identifiers for rows that scales well?

Wolf
  • 9,679
  • 7
  • 62
  • 108
Perpetualcoder
  • 13,501
  • 9
  • 64
  • 99
  • 2
    "There should be at least 3 columns that make up the primary key"? What does this mean? Can you provide further definition? Or is this part of #3? – S.Lott Jan 01 '09 at 00:42
  • 1
    @S.Lott `PK(NEWID(),NEWID(),NEWID())` ;-) –  Dec 15 '10 at 19:43
  • @pst: Why is this a requirement? Why must there be three columns in a PK? Why one one or four? – S.Lott Dec 16 '10 at 03:11
  • I could see a three column PK look like... LocalID (Auto increment int), GlobalID (GUID), ForeignId (foreign key like RolesType), etc. The LocalID+ForiegnId could be a compound key combination. The Guid is used for other website/services. Personally I wouldn't do this, I would just use Guid + ForiegnId. – Jerad Jul 17 '13 at 19:13

26 Answers26

94

If you're going to be doing any syncing between databases with occasionally connected apps, then you should be using GUIDs for your primary keys. It is kind of a pain for debugging, so apart from that case I tend to stick to ints that autoincrement.

Autoincrement ints should be your default, and not using them should be justified.

Bramha Ghosh
  • 6,504
  • 4
  • 30
  • 29
  • 2
    This really is the right answer, GUID = Global Unique IDentifier. This is what you want from a PK isn't it? Globally identify a piece of data? – ReaperUnreal Dec 31 '08 at 21:46
  • 3
    A GUID isn't necessary, just change the step to 10 or 20 or however many server's you'll need to possibly sync with in the future. – Robert C. Barth Dec 31 '08 at 22:06
  • GUIDs hose your index fill factor if you use them as your clustering index. Should use a GUID as an AK really. – stephbu Jan 01 '09 at 01:05
  • 49
    90% of the time at least, a GUID is not needed and wastes space. – Jonathan Leffler Jan 01 '09 at 03:10
  • And, unless you're really careful, can hose index fragmentation as well, since the most common algorithms used to create them, puts the "random" portion of the guid in the most significant bit positions... This increases the requirement for index defragmentation / Reindexing – Charles Bretana Jan 01 '09 at 20:11
  • 8
    I seriously feel GUIDs is an overkill. Never ever had a need to have GUIDs as my primary keys yet. – Cyril Gupta Jan 03 '09 at 04:10
  • @bramhaghosh yeah, guids suck... thanks. – John Boker Feb 05 '09 at 21:53
  • 2
    @stephbu: the SQL Server function newsequentialid() solves the index fragmentation problem, though 24 bytes is still a tad excessive if you don't absolutely need global uniqueness. See http://msdn.microsoft.com/en-us/library/ms189786.aspx. – ErikE Feb 02 '10 at 17:51
  • oops, I meant to say 16 bytes – ErikE Feb 02 '10 at 18:04
  • 7
    Or, instead of wasting space and risking collision with a GUID, make a composite key of the original primary key and a small identifier, where the small identifier is different for each sync source. – L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳ Feb 07 '10 at 22:25
  • 2
    The reason to use guids is to not depend on the database to generate them. I do not ever want my database to be in charge of generating my ids. – Chris Marisic May 04 '11 at 17:37
  • 6
    A shop I worked for used GUIDs for everything, even when public identifiers were available, like ISO country or language codes. And even when a boolean or `CHAR(1)` would have been enough, like for `sex`. Needless to say, it was a nightmare to work with. – Lumi May 23 '11 at 18:11
  • 2
    Using GUIDS as a PK adds a lot of overhead: http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx – Michael Paulukonis Dec 15 '11 at 21:36
60

I don't see an answer which points out (what I regard as) the really fundamental point - namely, that a primary key is what guarantees that you won't get two entries in the table for the same real-world entity (as modelled in the database). This observation helps establish what are good and what are bad choices for primary key.

For example, in a table of (US) state names and codes, either the name or the code could be the primary key - they constitute two different candidate keys, and one of them (normally the shorter - the code) is chosen as the primary key. In the theory of functional dependencies (and join dependencies - 1NF through 5NF - it is the candidate keys that are crucial rather than a primary key.

For a counter-example, human names generally make a bad choice for primary key. There are many people who go by the name "John Smith" or some other similar names; even taking middle names into account (remember: not everyone has one - for example, I don't), there is plenty of scope for duplication. Consequently, people do not use names as primary keys. They invent artificial keys such as the Social Security Number (SSN) or Employee Number and use them to designate the individual.

An ideal primary key is short, unique, memorable, and natural. Of these characteristics, uniqueness is mandatory; the rest have to flex given the constraints of real world data.

When it comes to determining the primary key of a given table, therefore, you have to look at what that table represents. What set or sets of column values in the table uniquely identifies each row in the table? Those are the candidate keys. Now, if each candidate key consists of 4 or 5 columns, then you might decide that those are too clumsy to make a good primary key (primarily on grounds of shortness). In those circumstances, you might introduce a surrogate key - an artificially generated number. Very often (but not always) a simple 32-bit integer is sufficient for the surrogate key. You then designate this surrogate key as the primary key.

However, you must still ensure that the other candidate keys (for the surrogate key is a candidate key too, as well as the chosen primary key) are all maintained as unique identifier - normally by placing a unique constraint on those sets of columns.

Sometimes, people find it difficult to identify what makes a row unique, but there should be something to do that, because simply repeating a piece of information doesn't make it any more true. And if you're not careful and do get two (or more) rows purporting to store the same information, and you then need to update the information, there is a danger (especially if you use cursors) that you will update just one row rather than every row, so the rows are out of synchrony and no-one knows which row contains the correct information.

This is a pretty hard-line view, in some respects.

I've no particular problem with using a GUID when they are needed, but they tend to be big (as in 16-64 bytes), and they are used too often. Very often a perfectly good 4-byte value would suffice. Using a GUID where a 4-byte value would suffice wastes disk space, and slows up even indexed access to the data since there are fewer values per index page, so the index will be deeper and more pages have to be read to get to the information.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 10
    Regarding your sample with US state names I would prefer a separate surrogate key, simply because the codes are something beyond your control. If they should change for whatever reason you get a problem. – Dirk Vollmar Jan 14 '09 at 23:43
  • 2
    (continued) For example, Germany replaced a 4-digit zip code system with a 5-digit system back in the 1990ies after the re-unification. – Dirk Vollmar Jan 14 '09 at 23:44
  • @divo: I am a strong advocate of artificial/surrogate keys, but even I don't see the 4-digit to 5-digit postal code change as being a good example. Postal codes are not generally used as keys to anything. (When's the last time you had to query a PostalCode table to find out something about that code? No, it's almost exclusively used as part of an address without being referenced in any other tables. I would say your suggestion is almost on par with using surrogate keys for addresses themselves.) – ErikE Feb 02 '10 at 17:55
  • 1
    @Emtucifor: Yes, maybe ZIP is not a very practical example, but my point was that if part of your surrogate key is out of your control and changes for whatever reason, you are in trouble. Think of someone creating a new social security number scheme, a new ISSN scheme or - maybe more realistic - a company deciding to create a new product id system after a merge, assigning new employee numbers to their employees to adjust their growth etc. These are all just fictional examples, but, as my previous example with the ZIP shows, sometimes a well-established system might change. – Dirk Vollmar Feb 02 '10 at 19:18
  • @divo: Agreed that Social Security Number is a good example. – ErikE Feb 03 '10 at 00:40
  • 2
    Your first point is correct. There is a name for this constraint. It's called "entity integrity". EI requires that every entity have a unique identity. Primary keys often meet this requirement, except when autonumber is used. With autonumber, you can get two rows that are identical, except for the autonumber. This usually violates entity integrity. – Walter Mitty Mar 12 '11 at 11:56
  • ZIP code was a valid example... I've done major projects involving data USA small parcel shipping e.g. time in transit, cost modeling, service level auditing, and it is ALL based on ZIP code... injection zip code (where the package enters the system) destination zip code (where it should go), and delivery zip code (where it went)... just to name a few. Get a rate on shipping - lookup via zip. Want time in transit? Lookup by zip. Ask a shipping manager how she wants to view carrier performance? - by 3-digit zip code. In data modeling, never make assumptions about domains you don't know. – GWR Jun 07 '17 at 19:15
28

This is only a religious issue because people seek a universal right answer. The fact that both your team and this SO thread shows so much disagreement should be a clue that there are good reasons to use all the solutions you describe, in different circumstances.

  • Surrogate keys are useful when no other attribute or set of attributes in the table is suitable to identify rows uniquely.
  • Natural keys are preferred, when possible, to make the table more human-readable. Natural keys also allow the foreign key in a dependent table to contain a real value instead of a surrogate id. E.g. when you need to store state (CA, TX, NY) you might as well use a char(2) natural key instead of an int.
  • Use compound primary keys where appropriate. Do not add an "id" surrogate key unnecessarily when a perfectly good compound key exists (this is especially true in many-to-many tables). A mandate for a three-column key in every table is absolute nonsense.
  • GUIDs are a solution when you need to preserve uniqueness over multiple sites. They are also handy if you need values in the primary key to be unique, but not ordered or consecutive.
  • INT vs. BIGINT: it's not common that a table requires a 64-bit range for primary keys, but with the increasing availability of 64-bit hardware it shouldn't be a burden, and gives more assurance that you won't overflow. INT is of course smaller, so if space is at a premium it can give a slight advantage.
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 9
    I disagree as much as a person can possibly do so. Natural keys are horrible. What if one wants to change the data? Oh, you can't. Writing joins on composite natural keys is a pain. Carrying that composite key to all your related tables is a waste. – Robert C. Barth Dec 31 '08 at 22:23
  • 2
    @Robert: read about "ON UPDATE CASCADE". But I get what you are saying, and I agree it's best to use a surrogate key most of the time, because attributes are subject to change and to be non-unique. – Bill Karwin Jan 01 '09 at 00:43
  • 2
    Primary keys should be immutable. Cascade updates are only a ugly hack for a bad design decision in this case. Natural keys are NEVER preferred. Same to composite keys, that spread themselves like a plague. Anyone with more than 3 months of database development experience would know this. – F.D.Castel Jan 05 '09 at 06:28
  • 7
    @F.D.: I don't agree with your unequivocal statement, and I've been developing with SQL databases since 1992. But certainly it's true that surrogate keys are best able to remain immutable. – Bill Karwin Jan 05 '09 at 18:59
21

I like The Database Programmer blog as a source for this kind of info.

3 columns for a primary key? I would say that columns should have appropriate unique constraints as the business rules demand, but I'd still have a separate surrogate key. Compound keys mean business logic enters into the key. If the logic changes, your whole schema is screwed.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 2
    They changed their link, here is the updated bookmark: http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html – Astra Dec 31 '08 at 21:29
  • Just inherited a project like this. And the very first thing they wanted to do blew up the schema. Surrogate Keys FTW. Business Logic in your DB FTL. – Jason Dec 15 '10 at 21:08
17

I like mine unique.

Andrew not the Saint
  • 2,496
  • 2
  • 17
  • 22
12

Slightly off-topic, but I feel compelled to chime in with...

If your primary key is a GUID, do not make it a clustered index. Since GUIDs are non-sequential, the data will be re-arranged on disk during almost every insert. (Yuck.) If using GUIDs as primary keys, they should be nonclustered indexes.

Portman
  • 31,785
  • 25
  • 82
  • 101
  • 1
    Very good point - one needs to distinguish between the *LOGICAL* concept of a primary key (might be valid to use a GUID for that, especially if replication is involved), and the *PHYSICAL* concept of the clustering key - that should *NEVER* be a GUID since it leads to excessive index fragmentation – marc_s Jan 07 '09 at 06:39
  • 3
    This is in fact not accurate. The data will be inserted in order, which given the GUID's random nature could end up being anywhere across the table. On the off chance that there isn't room, a page split will happen, but certainly not "re-arranging on disk during every insert" not even close. – Ralph Shillington Oct 16 '09 at 23:23
  • @Ralph, you're right, not EVERY insert, but enough to cause a 20x performance hit. http://www.sql-server-performance.com/articles/per/guid_performance_p1.aspx – Portman Oct 17 '09 at 02:45
  • 1
    The SQL Server function newsequentialid() solves the index fragmentation problem with GUIDs (though 24 bytes is still a tad excessive if you don't absolutely need global uniqueness). See msdn.microsoft.com/en-us/library/ms189786.aspx. – ErikE Feb 02 '10 at 18:00
  • You should not run a GUID clustered index with a Fill Factor or 100% for the pages anyway. – r3mark Jul 08 '22 at 03:23
11

I think the use of the word "Primary", in the phrase "Primary" Key is in a real sense, misleading.

First, use the definition that a "key" is an attribute or set of attributes that must be unique within the table,

Then, having any key serves several often mutually inconsistent purposes.

  1. To use as join conditions to one or many records in child tables which have a relationship to this parent table. (Explicitly or implicitly defining a Foreign Key in those child tables)

  2. (related) Ensuring that child records must have a parent record in the parent table (The child table FK must exist as Key in the parent table)

  3. To increase performance of queries that need to rapidly locate a specific record/row in the table.

  4. To ensure data consistency by preventing duplicate rows which represent the same logical entity from being inserted into the table. (This is often called a "natural" key, and should consist of table (entity) attributes which are relatively invariant.)

Clearly, any non-meaningful, non-natural key (like a GUID or an auto-generated integer is totally incapable of satisfying #4.

But often, with many (most) tables, a totally natural key which can provide #4 will often consist of multiple attributes and be excessively wide, or so wide that using it for purposes #1, #2, or #3 will cause unacceptable performance consequences.

The answer is simple. Use both. Use a simple auto-Generating integral key for all Joins and FKs in other child tables, but ensure that every table that requires data consistency (very few tables don't) have an alternate natural unique key that will prevent inserts of inconsistent data rows... Plus, if you always have both, then all the objections against using a natural key (what if it changes? I have to change every place it is referenced as a FK) become moot, as you are not using it for that... You are only using it in the one table where it is a PK, to avoid inconsistent duplicate data...

As to GUIDs, be very careful using them, as using GUIDs in an index can hose index fragmentation. The most common algorithms used to create them puts the "random" portion of the GUID in the most significant bit positions... This increases the requirement for regular index defragmentation / Reindexing as new rows are added.

Charles Bretana
  • 143,358
  • 22
  • 150
  • 216
  • The SQL Server function newsequentialid() solves the index fragmentation problem of GUIDs (though 24 bytes is still a tad excessive if you don't absolutely need global uniqueness). See msdn.microsoft.com/en-us/library/ms189786.aspx. – ErikE Feb 02 '10 at 17:58
  • oops, I meant to say 16 bytes. – ErikE Feb 02 '10 at 18:04
10

I always go with the surrogate key. A surrogate key (usually an identity column, autoincrement, or GUID) is one in which the key is not present in the data itself. A natural key, on the other hand, is one that, on its own, uniquely identifies the row. As near as I can tell in life, there are hardly any real natural keys. Not even things like SSN in the United States is a natural key. Composite primary keys are a disaster waiting to happen. You can't edit any of that data (which is the major drawback of any natural key, composite or not), but worse is that with a composite key, now you have to perpetuate that key data into every related table. What a giant waste.

Now, for selection of the surrogate key, I stick with identity columns (I work mostly in MS SQL Server). GUID's are too large and Microsoft recommends against using them as a PK. If you have multiple servers, all you need to do is make the increment 10 or 20 or whatever you think the maximum number of servers you'll ever need to sync/expand to, and just inc the seed for each table on each subsequent server, and you'll never have a data collision.

Of course, because of the increment, I make the identity column a BigInt (otherwise known as a long [64 bits]).

Doing a bit of math, even if you make the increment 100, you can still have 92,233,720,368,547,758 (> 92 quadrillion) rows in your table.

Robert C. Barth
  • 22,687
  • 6
  • 45
  • 52
9

One thing you should never do is use a smart key. That is a key where information about the record is coded in the key itself, and it will eventually bite you.

I worked one place, where the primary key was the account ID, which was a combination of letters and numbers. I don't remember any specifics, but, for example, those accounts that were of a certain type, would be in the 600 range, and of another type, started with 400. That was great, until that customer decided to ask for both types of work. Or changed the type of work they did.

Another place, used the location in the tree as the primary key for records. So there would be records like the following.

Cat1.subcatA.record1
Cat1.subcatA.record2
Cat1.subcatB.record1
Cat2.subcatA.record1

Of course, the first thing the customers wanted was a way to move items in the tree around. The entire set of software died before that happened.

Please, please, please, if you're writing code that I ever have to maintain, please don't use a smart key!

thursdaysgeek
  • 7,696
  • 20
  • 78
  • 115
5

I'm a fan of the auto-increment as primary key. I know deep in my heart that this is a cop-out, but it does make it so easy to sort data by when it was added (ORDER BY ID DESC, f'r instance).

3 columns sounds awfully harsh to humanly parse.

And that's the trade-off -- how much of the relational capability do you need, versus making THIS TABLE RIGHT HERE understandable to a human interrogating it (versus the stored-procedure or programmatic interface).

auto-increment is for us humans. :-(

Michael Paulukonis
  • 9,020
  • 5
  • 48
  • 68
5

Generally, it depends.

Personally, I like autoincrement ints.

But, one thing I can tell you is to never trust data from other sources as your key. I swear, every time I've done that it comes back to bite me. Well, never again!

BoltBait
  • 11,361
  • 9
  • 58
  • 87
4

There should be atleast 3 columns that make up the primary key.

I don't understand this.

Are you talking about a "natural key", e.g. "name and date of birth"? A natural key might be ideal if it exists, but most candidates for a natural key are either not unique (several people with the same name) or not constant (someone can change their name).

Int/ BigInt which autoincrement are good enough primary keys.

I prefer Guid. A potential problem with autoincrement is that the value (e.g. "order id") is assigned by the database instance (e.g. by the "sales database") ... which won't entirely work (instead you start to need compound keys) if you ever need to merge data created by more than one database instance (e.g. from several sales offices each with their own database).

ChrisW
  • 54,973
  • 13
  • 116
  • 224
  • Primary keys are required to be unique, but are not required to be constant. Hence foreign keys declared with "ON UPDATE CASCADE". But making an assumption that primary keys are constant helps to simplify many applications. This is one benefit of surrogate keys. – Bill Karwin Dec 31 '08 at 21:40
4

RE GUID's

Watch out if this is going to be a really Really REALLY REALLY big database, lots of load, and fast access.

At my last job, where we had databases of 100 to 500 million records, our database guys strongly argued against GUIDs, and for an appropriately sized decimal number. They felt that (under Oracle) the size difference in the internal storage for a string Guid - vs- a decimal value would make a very noticeable difference in lookups. ( Bigger keys = deeper trees to traverse)

The random nature of GUIDs also reduces the fill-factor for index pages significantly - this dramatically increases tearing and disk I/O.

stephbu
  • 5,072
  • 26
  • 42
  • "Reduces the fill-factor"? Not sure what that could mean Fill-factor is a one shot deal, defined as the percent of free space requested at the leaf-level of the index at the time the index is built. GUID values by their random nature distribution across the breadth of the leaf-level on inserts into that free-space that fill-factor provided. – Ralph Shillington Oct 17 '09 at 01:32
  • 1
    Since when is a GUID a string? GUIDs should be stored internally as 16 bytes by any respectable DBMS. Storing as 32 bytes in the hex representation would be unconscionable! (or 36 with dashes, or 38 with curly braces) – ErikE Feb 02 '10 at 18:04
3

This is a classic "it depends". There's no one right answer for every project. I like different things for different situations. It depends on whether I'm using an ORM and what it supports. It depends on the overall architecture (distributed or not, etc). Just pick one that you think will work and move on to arguing over tabs and spaces.

John Sheehan
  • 77,456
  • 30
  • 160
  • 194
3

I tend to use option #1 or #3 depending on the size, the number of people connecting, and whether it is a multiple database server situation or not.

Option #2 doesn't make much sense to me. If any one of the three is not enough to identify a unique record, then it's possible (without going through extra machinations) two have two records show up with the same values in all three columns. If you want to enforce uniqueness on any combination of the three, then just add an index for them.

BIBD
  • 15,107
  • 25
  • 85
  • 137
3

Auto increment columns. I am able to make my code work seamlessly with SQL Server or Oracle, one using identity the other using sequences through my DAL, and I couldn't be happier. I agree, GUIDs sometimes are necessary if you are doing replication or sending data away to receive it later on afer processing.

Otávio Décio
  • 73,752
  • 17
  • 161
  • 228
3

I've always used a surrogate key - an autoincrementing integer called 'id'. I can see plenty of reasons to do this even when another option is obvious:

  • Consistency
  • Data independent (unique, not destroyed by changes to format)
  • Human-readable

...and no sensible reason not to:

  • Ambiguity in joins? - Aliasing tables is a better practice, IMHO
  • Optimum tables? - Removing one byte per entry is premature optimisation, IMHO
  • Per-table decision? - No longer consistent
  • Scaling problems? - Eh? Why?
  • Hierarchical data structure? - That's denormalising, a whole other subject of religion. Suffice it to say I'm a fan in a few circumstances in theory, but never in practice :)

sensible reasons against that I haven't thought of or come across yet are always welcomed...

jTresidder
  • 466
  • 1
  • 3
  • 8
2

Guids.period.

In the event that you need to scale out or you need to assign the primary key by alternate means they will be your friend. You can add indexes for everything else.


update to clarify my statement.

I've worked on a lot of different kinds of sites. From small single server deals to large ones backed with multiple DB and web servers. There have certainly been apps that would have been just fine with auto incrementing ints as primary keys. However, those don't fit the model of how I do things.

When using a GUID you can generate the ID anywhere. It could be generated by a remote server, your web app, within the database itself or even within multiple databases in a multimaster situation.

On the other hand, an auto incremented INT can only be safely generated within the primary database. Again, this might be okay if you have an application that will be intimately tied to that one backing DB server and scaling out is not something you are concerned with.

Sure, usage of GUIDs mean you have to have nightly reindexing processes. However, if you are using anything other than an auto incremented INT you should do that anyway. Heck, even with an INT as the primary it's likely you have other indexes that need regenerated to deal with fragmentation. Therefore, using GUIDs doesn't exactly add another problem because those tasks need to be performed regardless.

If you take a look at the larger apps out there you will notice something important: they all use Base64 encoded GUIDs as the keys. The reason for this is simple, usage of GUIDs enables you to scale out easily whereas there can be a lot of hoops to jump through when attempting to scale out INTs.

Our latest app goes through a period of heavy inserts that lasts for about a month. After that 90+% of the queries are all selects for reporting. To increase capacity I can bring up additional DB servers during this large insert period; and later easily merge those into a single DB for reporting. Attempting to do that with INTs would be an absolute nightmare.

Quite frankly, any time you cluster a database or setup replication the DB server is going to demand that you have GUIDs on the table anyway. So, if you think that your system might need to grow then pick the one that's good.

NotMe
  • 87,343
  • 27
  • 171
  • 245
  • You ever examined the fill factor of your indexes? Random nature of GUID's make 'em swiss cheese - dramatically reducing their effectiveness. – stephbu Jan 01 '09 at 01:08
  • 2
    "Guids.period": That is so wrong. GUIDs should be used where appropriate. As the other commenter pointed out, it might make life as a programmer easy, but affects the overall size and performance of the DB. – Mitch Wheat Jan 01 '09 at 01:33
  • At the end of the day, I can scale my apps out across multiple database servers without issue. But I guess you guys work on small sites. – NotMe Jan 02 '09 at 18:18
  • 3
    GUID *might* be ok for the logical primary key, but NEVER EVER EVER use a GUID column as your CLUSTERING key - you'll be drowning in index fragmentation leading to POOR performance ..... – marc_s Jan 07 '09 at 06:40
  • I certainly wouldn't proclaim "Guids.period." on this topic -- in fact even in an industry so chock full of 'best practices' that sort of statement puts you on shaky ground by default (particularly with that statement). Anything as painful to deal with as a GUID needs some hard justification and as JL says, I think most of us would consider it a last resort. It's as if you posted without reading the rest of the thread. – Hardryv Feb 26 '10 at 22:40
2

I've only use an auto-increment int or a GUID. 99% of the time I've use auto-increment int. It's just what I was taught to use when I first learned about databases and have never run into a reason not to use them (although I know of reasons why a GUID would be better).

I like auto increment ints because it helps with readability. For example I can say "take a look at record 129383" and it's pretty easy for someone to go in and find it. With a GUID that's nearly impossible to do.

dtc
  • 10,136
  • 16
  • 78
  • 104
  • 2
    Why do you say that? It seems many people use an auto-increment integer. It can't be that bad if it works and works well for what you need. – dtc Jan 02 '09 at 21:53
2

Past a basic definitional answer, what constitutes a good primary key is left largely to religion and break room arguments. If you have something that is, and will always, map uniquely to an individual row, then it will work fine as a primary key. Past that point, there are other considerations:

  • Is the primary key definition not overly complex? Does it avoid introducing unnecessary complexity for the sake of following a "best-practice"?
  • Is there a better possible primary key that would require less overhead for the database to handle (i.e. INTEGER vs. VARCHAR, etc)?
  • Am I ABSOLUTELY certain that the uniqueness and defined-ness invariant of my primary key will not change?

This last one is likely what draws most people to use things like GUIDs or self-incrementing integer columns, because relying on things like addresses, phone numbers, first/last names, etc, just don't cut it. The only invariant about people I can think of is SSNs, but then I'm not even 100% certain about those remaining forever unique.

Hopefully this helps add some clarity...

Ed Carrel
  • 4,154
  • 1
  • 25
  • 17
2

The way I approach primary keys (and I feel is the best) is to avoid having a "default" approach. This means instead of just slapping on an auto-incrementing integer and calling it a day I look at the problem and say "is there a column or group of columns that will always be unqiue and won't change?" If the answer is yes then I take that approach.

Andrew G. Johnson
  • 26,603
  • 30
  • 91
  • 135
  • Does that mean you 'avoid auto-incrementing integers whenever you can'? My understanding was that industry experts thought the best performance on large-scale databases comes from minimal-signature, indexed, incremental single-column PKs. – Hardryv Feb 26 '10 at 22:44
  • 1
    I always thought experts used the best tool for the job – Andrew G. Johnson Feb 27 '10 at 20:34
2

Almost always integers.

They have other good reasons besides being smaller/faster to process. Which would you rather write down - "404040" or "3463b5a2-a02b-4fd4-aa0f-1d3c0450026c"?

  • The latter may be an integer, with dashes added and in base 16. But yes, 404040 is faster to process than the long GUID. Then again, 0 is even faster to process because it doesn't require a single bit of data! – strager Jan 01 '09 at 20:26
2

Only slightly relevant, but one thing I've started doing recently when I have small classification tables (essentially those that would represent ENUMs in code) is that I'll make the primary key a char(3) or char(4). Then I make those primary keys representative of the lookup value.

For example, I have a quoting system for our internal Sales Agents. We have "Cost Categories" that every quote line item is assigned one of... So I have a type lookup table called 'tCostCategories', where primary key is 'MTL', 'SVC', 'TRV', 'TAX', 'ODC'. Other columns in the lookup table store more details, such as the normal english meanings of the codes, "Material", "Service", "Travel", "Taxes", "Other Direct Costs", and so forth.

This is really nice because it doesn't use any more space than an int, and when you are looking at the source data, you don't have to link the lookup table to know what the heck the value is. For example, a quote row might look like:

1 PartNumber $40 MTL
2 OtherPartNumber $29.99 SVC
3 PartNumber2 $150 TRV

It's much easier that using an int to represent the categories and then linking 1, 2, 3 on all the lines - you have the data right there in front of you, and the performance doesn't seem affected at all (not that I've truly tested.)

As far as the real question goes... I like RowGUID uniqueidentifiers. I'm not 100% on this, but don't all rows have internal RowGuid's anyway?? If so, then using the RowGuid would actually take less space than ints (or anything else for that matter.) All I know is that if it's good enough for M$ to use in GreatPlains then it's good enough for me. (Should I duck??)

Michael Bray
  • 14,998
  • 7
  • 42
  • 68
2

Oh one more reason I use GUIDs - I use a hierarchical data structure. That is, I have a table 'Company' and a table 'Vendor' for which the Primary Keys match up. But I also have a table 'Manufacturer' that also 'inherits' from Company. The fields that are common to Vendors and Manufacturers don't appear in those tables - they appear in Company. In this setup, using int's is much more painful than Guids. In the very least, you can't use identity primary keys.

Michael Bray
  • 14,998
  • 7
  • 42
  • 68
  • 1
    Yes you can, you just don't make the subtype tables have the identity property, instead they get explicit inserts of the supertype table value. Please see http://stackoverflow.com/questions/2112882/designing-a-conditional-database-relationship-in-sql-server/2113129#2113129 – ErikE Feb 02 '10 at 18:08
2

I like natural keys, whenever I can trust them. I'm willing to pay a small performance price price in order to use keys that make sense to the subject matter experts.

For tables that describe entities, there should be a simple natural key that identifies individual instances the same way the subject matter people do. If the subject matter does not have trustworthy identifiers for one of the entities, then I'll resort to a surrogate key.

For tables that describe relationships, I use a compound key, where each component references an entity that participates in the relationship, and therefore a row in an entity table. Again, the performance hit for using a compound key is generally minimal.

As others have pointed out, the term "primary key" is a little misleading. In the Relational Data Model, the term that's used is "candidate keys". There could be several candidate keys for a single table. Logically, each one is just as good as another. Choosing one of them as "primary" and making all references via that key is simply a choice the designer can make.

Walter Mitty
  • 18,205
  • 2
  • 28
  • 58
  • Please describe some sample trustworthy natural keys? – ErikE Feb 02 '10 at 18:09
  • 1
    "trustworthy" is not a property of a key by itself. Rather, it has to do with the key in the context of the people who supply the data. If you are writing an app to be sold to somebody who will actually be managing the data, you have to guess which keys will be trustworthy to the client or not. Given the variety of clients, you're almost surely going to guess wrong for some fraction of your clientele. – Walter Mitty Feb 03 '10 at 14:23
  • Having said the above, here's an example of a key that we trusted way back when. We had a database about courses. It included textbooks and other course materials about courses, scheduled course offerings, instructors who were qualified to teach courses, course prerequisites, tuition, and so on. When course development created a new course, one of the first things they did was to assign a course code. They were responsible for making sure that course codes were unique, and that courses never changed their code, once assigned. It was part of the data as given to us. – Walter Mitty Feb 03 '10 at 14:30
  • Another good example of trusted natural key is VIN (Vehicle Identification Number). For the last many years, every vehicle sold as new has a VIN attached to it. They can be trusted to be unique and unchanging. – Walter Mitty Feb 03 '10 at 14:32
0

This is a complex subject whether you realized it or not. Might fall under the section on this StackOverflow FAQ.

What kind of questions should I not ask here?

Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!

This has been debated for years and will continue to be debated for years. The only hints of consensus I have seen is that the answers are somewhat predictable depending on if you are asking a OO guy (GUIDs are the only way to go!), a data modeler (Natural keys are the only way to go!), or a performance oriented DBA (INTs are the only way to go!).

Shane Delmore
  • 1,575
  • 2
  • 13
  • 19
  • I will not let the discussion go to long. I was just curious to see the general consensus. – Perpetualcoder Dec 31 '08 at 21:34
  • 1
    I say ask whatever questions you wish! Else, this community will become static and overcontrolled like wikipedia seems to have become. Seems to me like some times you need to let people ask whatever do choose to ask. Trust them, and they might come to trust themselves! – Nicholas Leonard Jan 04 '09 at 23:16