63

The team I'm working with decided to create a table with a varchar primary key. This table is referenced by another table on this primary key.

I've the habit to create an integer primary key, following what I learnt at university. I've read that there is a performance boost using integer primary key.

The matter is that I don't know any other reason for creating an integer primary key. Do you have any tips?

Veedrac
  • 58,273
  • 15
  • 112
  • 169
frabiacca
  • 1,402
  • 2
  • 16
  • 32
  • Can you give a bit more information about the specific table in question? What sort of data is stored in this table, and which field is the primary key? – Mark Byers Jul 01 '10 at 23:02
  • 5
    Did you ask why your peer(s) decided to create a table with a primary key of type VARCHAR? – J. Polfer Jul 01 '10 at 23:07
  • @Mark: i didn't give you guys much more information about tables, 'cause it was a general question on primary key varchar/int. I can tell u just that i'm talking about a table that is referenced by many tables on the db. – frabiacca Jul 04 '10 at 12:40
  • This is a very popular question. You can find some discussions at the links below. http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/d3227460-81c3-48eb-9ad8-c9a1fcc5688c http://www.sqlservercentral.com/Forums/Topic929546-149-1.aspx http://stackoverflow.com/questions/144109/sql-server-normalization-tactic-varchar-vs-int-identity http://weblogs.sqlteam.com/brettk/archive/2004/06/09/1530.aspx – Garett Jul 01 '10 at 23:09
  • 6
    Why is this question closed as 'Not constructive'?! I find it very constructive... – Zeth Oct 14 '17 at 16:40
  • 1
    @Zeth Vote to reopen – Smart Manoj Dec 16 '20 at 05:26
  • [Should I design a table with a primary key of varchar or int?](https://stackoverflow.com/q/1301165/5033247) should be closed in this type – Smart Manoj Dec 16 '20 at 05:28

3 Answers3

52

The primary key is supposed to represent the identity for the row and should not change over time.

I assume that the varchar is some sort of natural key - such as the name of the entity, an email address, or a serial number. If you use a natural key then it can sometimes happen that the key needs to change because for example:

  • The data was incorrectly entered and needs to be fixed.
  • The user changes their name or email address.
  • The management suddenly decide that all customer reference numbers must be changed to another format for reasons that seem completely illogical to you, but they insist on making the change even after you explain the problems it will cause you.
  • Maybe even a country or state decides to change the spelling of its name - very unlikely, but not impossible.

By using a surrogate key you avoid problems caused by having to change primary keys.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • i marked this as the answer, just because you're the first one who gave me another point to reason on. thx mark – frabiacca Jul 04 '10 at 12:46
  • 3
    it's an old one but where did you find the "should not change over time" clause? there is nothing, as far as i know, that says it should be constant. – Asken Mar 13 '13 at 08:57
  • 1
    @Asken You're right, but there will potentially be a lot of records (in other tables) which reference that primary key. To change a primary key you also need to change every reference to it. – William Aug 09 '14 at 15:16
  • Excellent advice. If a value has some external meaning, it may need to change eventually, making it an undesirable candidate for primary key. – Timo Sep 05 '16 at 14:43
  • Then again, if it has an external meaning, no need to look it up via natural attribute and have an extra index on that attribute if you just use it as a primary key. With regards of the reservations against changing natural keys - often it suffices to assume a business logic where a change in the natural key means it is an entirely different object, which is often also... natural. – Frank Hopkins Aug 28 '19 at 14:39
51

VARCHAR vs. INT doesn't tell much. What matter is the access pattern.

On absolute terms, a wider key will always be worse than a narrow key. The type carries absolutely no importance, is the width that matters. When compared with INT though, few types can beat INT in narrowness, so INT usually wins that argument just by the fact that is only 4 bytes wide.

But what really matters is the choice of clustered key. Often confused with the primary key, the two represent different notions and are not required to overlap. Here is a more detailed discussion Should I design a table with a primary key of varchar or int? The choice of the clustered key is just about the most important decision in table design, and a mechanical application of an INT identity(1,1) on it may be just the biggest mistake one can make. Here is where the question of access patterns comes in:

  • what are the most frequent interrogations on the table?
    • what columns are projected?
    • what predicates are applied?
    • what ranges are searched?
    • what joins are performed?
    • what aggregations occur?
  • how is the data inserted into the table?
  • how is the data updated in the table?
  • how is old data purged from the table, if ever?
  • how many non-clustered indexes exist?
    • how often are columns included in the NC indexes (key or leaf) are updated?

Overall, there are many access patterns that can be ruined by using an INT IDENTITY clustered key. So before jumping to apply a cookie cutter solution, perhaps a little bit of analysis is required...

Some more general guidelines:

You see there are no Primary Key design guidelines, because the Primary key is not an issue of storage design but an issue of modeling and is entirely domain driven.

Balder
  • 8,623
  • 4
  • 39
  • 61
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • 1
    i think that developer i work with, created a varchar primary key basically on what you wrote down: "the Primary key is not an issue of storage design but an issue of modeling and is entirely domain driven" i mean: i agree if an entity identifier is a string (then, it depends on model design though), but i don't agree if you model a domain and then migrate it on the db. I'd prefer to have an entity id as string on my model, an int primary key and a unique varchar key (corresponding to the entity id) on my db – frabiacca Jul 04 '10 at 12:44
29

i was a bit disappointed because i've the habit to create an integer primary key (following what some teacher told me at the university). I've read a lot of documentation on the performance boost using integer primary key.

There is a term for this: confirmation bias:

"also called confirmatory bias or myside bias) is a tendency for people to favor information that confirms their preconceptions or hypotheses, independently of whether they are true. This results in people selectively collecting new evidence, interpreting evidence in a biased way, or selectively recalling information from memory."

Of course, your first reaction will be to say, "But that's not true!" Yeah, you would say that 'cos you're biased ;) [tongue firmly embedded in cheek]

Here's a classic example: say you had been told by your zoology professor that all swans are white and, sure enough, all swans you and your friends have ever encountered are white. Now let's say later in life a colleague expressed the opinion that perhaps there is such creature as a black swan. What?! That's not what your were taught. Your world is rocked! You immediately go out and conduct a swan survey and you count 1,000 white swans and zero black swans. Proof! If you'd found 10,000 white swans then the hypothesis 'All swans are white' would be ten times truer, right?

A different approach would be to forget about white swans for the moment and try to seek out a black swan. Perhaps take a holiday by the sea in sunny Dawlish?

I really don't mean to sound disrespectful; you admit to reading a lot about what you have been told and that indeed earns my respect. So here's a challenge: try to find cases where adding an integer column to a table is unnecessary.

Here are some hints and spoilers: tables that are not referenced by other tables; single column 'all key' lookup tables; 'small' tables that aren't queried much :)

Here are some other related topics you may like to investigate:

Does the word 'primary' in 'primary key' have much meaning or are all keys in a given table equal?

What are the qualities of a 'good' key? (e.g. should a key's values be immutable or is a stability 'good' enough?)

Is an integer column added to the table as an artifical key (perhpas because the available natural key is not 'good' enough) or as a surrogate key (perhaps to boost performance of an otherwise 'good' natural key)?

When a surrogate key is added to a table on performance grounds, is this for actual measured effect or merely for perceived effect (i.e. premature optimization)?

Should surrogate keys appear in the logical business model or are they for implementation only?

Is it a good idea to always do something (e.g. add an integer column to a table) without engaging the brain each time? ;)

[Disclaimer: I'm a natural key advocate and shun surrogates. For me they are like denormalization: you only do it when you have to, usually for a performance issue (specific and demonstrable), where the fault lies elsewhere (lousy SQL product version, logical design flaw that cannot be fixed at this time, etc). Surrogates should never appear in the logical business model. I sometimes need an artificial identifier and have even exposed them logical business models.]

onedaywhen
  • 55,269
  • 12
  • 100
  • 138
  • 3
    +1 for appropriate referencing Confirmation bias :) – Googlebot Oct 25 '11 at 02:14
  • 1
    This is a well written, thought provoking answer. It has succeeded in making me think twice about my nearly automatic application of artificial keys in sql tables. Well done! – Ty H. Jul 06 '15 at 16:43
  • 4
    Another example of confirmation bias in computing: 40-day password expiration. The story goes that back in the day of the VAX computer, someone figured out that it would take the computer a little over 40 days to brute force decode a password, so they set the rule that users had to change theirs every 40 days, and it stuck. – A.Grandt Jul 28 '16 at 07:44
  • I would like to know more about 40-day password expiration and similar things. Suggest a Reddit or something. Sorry for commenting not relevant comment. – Naveen Niraula Aug 20 '18 at 11:11