65

I'm trying to figure out the "best practices" for deciding whether or not to add an auto-incrementing integer as the primary key to a table.

Let's say I have a table containing data about the chemical elements. The atomic number of each element is unique and will never change. So rather than using an auto-incrementing integer for each column, it would probably make more sense to just use the atomic number, correct?

Would the same be true if I had a table of books? Should I use the ISBN or an auto-incrementing integer for the primary key? Or a table of employees containing each person's SSN?

jamieb
  • 9,847
  • 14
  • 48
  • 63
  • 1
    This question, in various forms, is somewhat perennial... see http://stackoverflow.com/questions/532363/native-primary-key-or-auto-generated-one for example. – mjv Feb 02 '10 at 17:44
  • This is not really worth an answer but here's my opinion: if you're absolutely sure you won't ever have two books with the same ISBN or two persons with the same SSN, I wouldn't hesitate using those values as primary keys. But that's only a habit I guess. Some ORM systems like Django's models make it vary hard to do that and insist on always having a numeric incremental ID. On the other hand, if you're on PostgreSQL, you can even do things like *dual primary keys*. I like to use them whenever I can. – Attila O. Feb 02 '10 at 17:49
  • possible duplicate of [Surrogate vs. natural/business keys](http://stackoverflow.com/questions/63090/surrogate-vs-natural-business-keys) – Warren Dew May 04 '14 at 01:04

8 Answers8

16

There are a lot of already addressed questions on Stack Overflow that can help you with your questions. See here, here, here and here.

The term you should be looking for: surrogated keys.

Hope it helps.

Community
  • 1
  • 1
Pablo Santa Cruz
  • 176,835
  • 32
  • 241
  • 292
  • 1
    A `surrogate key` is a key which does not have any contextual or business meaning. It is manufactured “artificially” and only for the purposes of data analysis. The most frequently used version of a surrogate key is an increasing sequential integer or “counter” value (i.e. 1, 2, 3). White a `natural key` is a key that has contextual or business meaning (for example, in a table containing STORE, SALES, and DATE, we might use the DATE field as a natural key when joining with another table detailing inventory). – carloswm85 Sep 07 '21 at 17:01
13

This is a highly debated question, with lots of emotion on both sides.

In my humble opinion, if there's a good, useable natural key available -- like an ISBN -- I use it. I'm going to store it on the database anyway. Yes, a natural key is usually bigger than an integer auto-increment key, but I think this issue is overblown. Disk space is cheap today. I'd worry more about it taking longer to process. If you were talking about an 80 byte text field as a primary key, I'd say no. But if you're thinking of using a 10-byte ISBN instead of an 8-byte big integer, I can't imagine that brings much of a performance penalty.

Sometimes there's a performance advantage to natural keys. Suppose, for example, I want to find how many copies of a given book have been sold. I don't care about any of the data from the Book master record. If the primary key is ISBN, I could simply write "select count(*) from sale where isbn='143573338X'". If I used an autoincrement key, I would have to do a join to look up the isbn, and the query becomes more complex and slower, like "select count(*) from book join sale using (bookid) where isbn='143573338X'". (And I can assure you that that as this particular ISBN is for my book, the number of sale records is very small, so doing the join and reading one extra record is a big percentage difference!)

Another advantage of natural keys is that when you have to work on the database and you look at records that refer back to this table by key, it's easy to see what record they're referring to.

On the other hand, if there is no good, obvious natural key, don't try to cobble together a crazy one. I've seen people try to make a natural key by concatenating together the first 6 letters of the customers first name, his year of birth, and his zip code, and then pray that that will be unique. That sort of silliness is just making trouble for yourself. Often people end up taking on a sequence number to insure it's unique anyway, and at that point, why bother? Why not just use the sequence number by itself as the key?

Jay
  • 26,876
  • 10
  • 61
  • 112
5

You've got the idea right there.

Auto-increment should be used as a unique key when no unique key already exists about the items you are modelling. So for Elements you could use the Atomic Number or Books the ISBN number.

But if people are posting messages on a message board then these need a unique ID, but don't contain one naturally so we assign the next number from a list.

It make sense to use natural keys where possible, just remember to make the field as the primary key and ensure that it is indexed for performance

Craig Warren
  • 1,655
  • 4
  • 23
  • 38
  • 2
    "Auto-increment should be used as a unique key when no unique key already exists" -- I couldn't disagree more. – onedaywhen Dec 14 '11 at 09:35
  • @onedaywhen Why is that? Are you suggesting you should _always_ an auto-increment key? – MrWhite Oct 24 '18 at 16:53
  • @MrWhite: My suggestion would be to never use an auto-increment column. But the point I am making in the above comment is that if your data lacks a key then adding an auto-increment is almost never the solution. Sometimes the key actually exists in the data but it is undesirable because it is a compound and/or temporal. Sometimes you will want to fabricate a key and expose it in your data model but auto-increment is rarely a good fit. Sure, atomic number is a sequence number but it benefits from having a trusted source. ISBN contains a checksum, another good property of a key. – onedaywhen Oct 26 '18 at 11:24
4

With regards to using ISBN and SSN you really have to Think about how many rows in other tables are going to reference these through foreign keys because those ids will take up much more space than an integer and thus may lead to a waste of disk space and possibly to worse join performance.

Klaus Byskov Pedersen
  • 117,245
  • 29
  • 183
  • 222
  • "will take up much more space than an integer and thus may lead to a waste of disk space" -- greetings from the year 2012 (well, almost!): I am here to tell you that disk space is now as cheap as chips :) – onedaywhen Dec 14 '11 at 09:45
  • 1
    @onedaywhen join performance is still something to consider in the year 2012 though :-P – Klaus Byskov Pedersen Dec 14 '11 at 10:04
  • If you store the ISBN on the referencing tables then you don't need any joins to get ISBN (slower inserts). If you only ever store it in one table then you will always need to join back to that one table to get the ISBN (slower queries). What happen most often in the books database: a book is added or a book turns in up in a search etc? P.S. how's the performance in 2018? :) – onedaywhen Oct 26 '18 at 11:36
3

The main problem that I have seen with the auto incrementing an integer approach is when you export your data to bring into another db instance, or even an archive and restore operation. Because the integer has no relation to the data that it references, there is no way to determine if you have duplicates when restoring or adding data to an existing database. If you want no relationship between the data contained in the row and the PK, I would just use a guid. Not very user friendly to look at, but it solves the above problem.

Rob Goodwin
  • 2,676
  • 2
  • 27
  • 47
2

I'm trying to figure out the "best practices" for deciding whether or not to add an auto-incrementing integer as the primary key to a table.

Use it as a unique identifier with a dataset where the PKey is not part of user managed data.

Let's say I have a table containing data about the chemical elements. The atomic number of each element is unique and will never change. So rather than using an auto-incrementing integer for each column, it would probably make more sense to just use the atomic number, correct?

Yes.

Would the same be true if I had a table of books? Should I use the ISBN or an auto-incrementing integer for the primary key? Or a table of employees containing each person's SSN?

ISBNs/SS#s are assigned by third-parties and because of their large storage size would be a highly inefficient way to uniquely identify a row. Remember, PKeys are useful when you join tables. Why use a large data format like an ISBN which would be numerous textual characters as the Unique identifier when a small and compact format like Integer is available?

Keith Adler
  • 20,880
  • 28
  • 119
  • 189
  • "Let's say I have a table containing data about the chemical elements... it would probably make more sense to just use the atomic number" -- note there are three candidate keys: atomic weight, symbol and number. Should all have unique constraints in the database table? Is it worth picking one to be the primary key? If so, based on what criteria? p.s. there is no 'correct' answer to these questions :) – onedaywhen Dec 14 '11 at 09:42
  • Is it true that a `CHAR(13)` value is "large" and "highly inefficient"? – onedaywhen Dec 14 '11 at 09:43
0

Old topic I know, but one other thing to consider is that given that most RDBMSes lay out blocks on disk using the PK, using an auto-incrementing PK will simply massively increase your contention. This may not be an issue for your baby database you're mucking around with, but believe me it can cause massive performance issues at the bigger end of town.

If you must use an auto-incrementing ID, maybe consider using it as part of a PK. Tack it on the end to maintain uniqueness.....

Also, it is best to exhaust all possibilities for natural PKs before jumping to a surrogate. People are generally lazy with this.

Simon
  • 1
  • "consider using it as part of a PK" - How is using "it as part of a PK" better than using it solely as the PK? "it is best to exhaust all possibilities for natural PKs before jumping to a surrogate" - can you explain why? This would seem to fly in the face of popular opinion, such as https://stackoverflow.com/questions/590442/deciding-between-an-artificial-primary-key-and-a-natural-key-for-a-products-table – MrWhite Oct 24 '18 at 17:27
0

One other thing I haven't seen mentioned is you're potentially exposing statistics you may not want to. For example, when this question was written, it was (give or take) the 2,186,260th one posted on Stack Overflow, which I can tell by literally just looking at it's ID in the URL.

Krusty the Clown
  • 517
  • 6
  • 24