8

Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.

What are the criteria?

bbadour
  • 616
  • 1
  • 4
  • 19
  • 8
    If this is homework, please inform us and explain how you are approaching the problem and what, exactly, you are stuck on. – Randolpho Sep 03 '10 at 02:39
  • It's not exactly homework--at least not for me. – bbadour Sep 03 '10 at 02:50
  • 1
    What is ambiguous about it? Are you unclear on "choosing", "designing", "databases", "tasks", "primary keys" or "criteria" ? What part needs clarification to eliminate the ambiguity? – bbadour Sep 03 '10 at 03:59
  • 1
    I see you are self-taught. I guess if your teacher doesn't understand what it means to choose, we have no reason to expect his student to either. – bbadour Sep 03 '10 at 05:05
  • 1
    I have absolutely been nice in everything I wrote. Is it "not nice" to ask for clarification with questions like "So you are saying there are no criteria at all. Is that correct?" The original question is absolutely not rhetorical. It is a very important question with a very simple answer that everyone designing databases needs to be able to enumerate as readily as the alphabet. – bbadour Sep 03 '10 at 05:24
  • 6
    Stability, familiarity and size are three important design criteria for keys. I disagree with the suggestion that this question be closed. The question is clear and straightforward. – nvogel Sep 03 '10 at 05:36
  • @dportas: I would choose "simplicity" rather than "size". Size is a physical attribute while keys are logical constructs. One can physically represent a compound key with 5 boolean attributes in less than a byte, but such a key is not simple. One can physically represent something like a book or a video with a pointer to its image somewhere, but that won't make those types any simpler. You omitted the 2 most important criteria, but I assume you take those as givens. If you think about it, can you see what those might be? – bbadour Sep 03 '10 at 13:40
  • 2
    @bbadour - I voted to reopen but it seems like you have quite a clear idea of what you think the answer is already. Why not give it as an answer? – Martin Smith Sep 04 '10 at 17:31
  • I prefer to reward someone else for doing so, and at the moment, the question is closed in any case. – bbadour Sep 04 '10 at 18:02
  • 1
    @bbadour: are you downvoting every answer that doesn't meet your expectations? If so, you're using downvotes in a non-standard way, and not endearing yourself to the community. I sense you don't care much about that, but that's going to reduce the help you get from SO folks – Michael Petrotta Sep 04 '10 at 19:14
  • 3
    I downvoted exactly 2 answers (1 to this question and 1 to another) contrasted with the 46 upvotes I gave. I downvoted both because they were not only poor but egregiously misinforming. In one case, the worst parts were fixed and I removed my downvote -- even though, the answer remains a poor one. *Is that not the standard way to use downvotes?* Thus far, it looks like SO folks have much more need of my help than vice versa. It first looked like I might entertain myself and give a little back as it were, but appearances deceive. I seem to have exhausted the supply of interesting questions. – bbadour Sep 04 '10 at 21:45
  • @bbadour: Agreed, "simplicity" is what I should have said rather than "size". I did take uniqueness and minimality as given. – nvogel Sep 06 '10 at 21:26
  • 2
    @dportas: I am glad to see you came back. I was hoping you might write up an answer to the question seeing as you were only able to comment before when it was closed. Since your comment was the 1st contribution to actually provide a correct list (even if it omitted the zeroth items,) you certainly deserve at least a shot at the bounty. – bbadour Sep 07 '10 at 02:26
  • 6
    Why was this question reopened? To me it's a clear case of someone wanting attention and trying to get it by putting up a huge bounty on a question that has no one "correct" answer, and which seems to be designed only to allow him/her to show off how smart they are (or think they are)? – Phil Sandler Sep 08 '10 at 13:48
  • "It's not exactly homework--at least not for me." Who is it homework for, then? The internet? – Phil Sandler Sep 08 '10 at 13:49
  • @Phil: When someone else comes along to ask a homework question, if the answers here answer the question, the question will be that person's homework question. – bbadour Sep 09 '10 at 00:11
  • 2
    @mickeyf: But who will pay the price of your fool's bargain? – bbadour Sep 09 '10 at 00:11

12 Answers12

18

The criteria for consideration of a primary key are:

  • Uniqueness
  • Irreducibility (no subset of the key uniquely identifies a row in the table)
  • Simplicity (so that relational representation & manipulation can be simpler)
  • Stability (should not be altered frequently)
  • Familiarity (meaningful to the user)
BradC
  • 39,306
  • 13
  • 73
  • 89
  • I was going to write up a more complete answer, but some things came up keeping me from doing so. In the end, I decided to award the bounty to Brad for persistence, clarity and brevity. – bbadour Sep 09 '10 at 04:50
17

What is a Primary Key?

The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.

Ability to Change

Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.

Natural or Artifical/Surrogate?

Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:

US_STATE_PRIMARY_KEY  US_STATE
-------------------------- 
AL                    Alabama
AK                    Alaska
AZ                    Arizona
AR                    Arkansas
CA                    California

Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.

Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:

US_STATE_PRIMARY_KEY  US_STATE
-------------------------- 
100                   Alabama
101                   Alaska
102                   Arizona
103                   Arkansas
104                   California

There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.

Single Column vs Composite

Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).

Performance

From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
  • If you have a primary key in a relation with other data, doesn't the relation relate the primary key to the data? – bbadour Sep 03 '10 at 02:58
  • 3
    @bbadour: That's a foreign key, if I understand you correctly. A foreign key can be a primary key, and generally is in many-to-many tables. – OMG Ponies Sep 03 '10 at 03:00
  • @bbadour: Sorry, but having re-read your comment I see that your question is addressed by where I distinguish between what a natural and artificial key is. An id value of 1 on it's own does not relate back & forth between the record it is the primary key for. – OMG Ponies Sep 03 '10 at 03:37
  • If that is supposed to address the question, I can't make any sense of what you wrote at all. – bbadour Sep 03 '10 at 03:51
  • @bbadour: Can you be more specific about what you don't understand? – OMG Ponies Sep 03 '10 at 03:58
  • I don't understand how you think you communicated anything sensible about how primary keys in relations relate to data. – bbadour Sep 03 '10 at 04:02
  • 4
    @bbadour: That's the first lesson with primary keys - do you have a naturally occurring value that uniquely identifies a record? If the answer is no, you need to use an artificial value (integer, GUID). – OMG Ponies Sep 03 '10 at 04:07
  • 1
    Now you seem to be saying that uniqueness and identification are design criteria. Is that correct? – bbadour Sep 03 '10 at 04:31
  • @bbadour: I don't think "identification" is the best choice to describe natural/artificial keys, but yes - both are design criteria. – OMG Ponies Sep 03 '10 at 04:35
  • 1
    If a key doesn't identify, what does it do? – bbadour Sep 03 '10 at 04:52
  • 1
    @bbadour: "identification" doesn't explain natural/artificial values for a key. Both are ultimately used to identify a record in a table, but as I've said numerous times now - an artificial key value, isolated from any of the data in the table, has no relational/identification significance. A value of "1" tells you nothing about the record it is attached to. – OMG Ponies Sep 03 '10 at 05:00
  • Sure it does if it is attached to it. – bbadour Sep 03 '10 at 05:09
  • To clarify, it says: `This is the one with 1 attached to it.` – bbadour Sep 03 '10 at 05:16
  • 3
    If you're going to use natural keys, be very careful that you pick one that is static. Phone numbers may be unique, but they can also be changed. Similarly, a company may change their employee ID format. Even SSNs can be changed or duplicated (in rare circumstances). State abbreviations could be a reasonable primary key as long as you're using it only as a primary key, not for displaying, since capitalization conventions can change,. – Lèse majesté Sep 04 '10 at 08:49
  • 1
    @Lèse majesté: Familiarity and stability often conflict. Any of the criteria can conflict. The art of design involves finding an effective trade-off among conflicting criteria. Different designers will make different trade-offs. Some better than others. – bbadour Sep 04 '10 at 13:37
  • 1
    @bbadour: Consider a table containing the US states, and compare using the abbreviation as the primary key vs an integer starting at one. In context, knowing only the abbreviation & that the table is for US States is enough for a layman to know which US State is referenced in a foreign key. That's the case if the value is an integer--one might assume that the numbering is alphabetic, but it's not a guarantee. The integer value is completely arbitrary - some systems like to start the value @ 100/etc for security purposes (data obfusication). – OMG Ponies Sep 05 '10 at 18:01
  • Consider 2 equivalent tuples: {3296013,840082470,Bob Badour,PE,CA} or {840082470,Bob Badour,745} Which is more useful and informative? – bbadour Sep 05 '10 at 21:35
  • @bbadour: Modeling a person is a notorious poor example--first, middle and surnames can match but actually reflect different people. SSN/SIN are not reliable, because they can change (immigration, legally dead). A name doesn't guarantee gender, and gender can be changed. The situation requires using an artificial key for these numerous facts--there is no natural key with regard to people. DNA hasn't been able to distinguish between twins... – OMG Ponies Sep 05 '10 at 22:16
  • 1
    @OMG Ponies: None of your comments has any relevance to my example. My example does not use name as a candidate key for either tuple. Yet, even without attribute names or domain names, you recognized the domain, which speaks to the value of familiarity. In any case, I see I made a mistake in my example, I had intended {3296013,840082470,Bob Badour,745) versus {840082470,Bob Badour,PE,CA} – bbadour Sep 06 '10 at 00:49
  • 1
    @bbadour: That's because your example has nothing to do with the question - "usefulness" is meaningless to primary key criteria. "Informative" I reach to construe as being a natural key, which is a common modeling convention I've yet to see you accept. Your statement reinforces the use of an artificial key. I don't understand what you fail to grasp, and why you are so combative. – OMG Ponies Sep 06 '10 at 01:11
  • 2
    I don't know why you assume I am combative or why you assume I fail to grasp something. Perhaps you project your own flaws onto me? In my corrected example, the 1st tuple follows the rule of always using a surrogate key, and the 2nd tuple makes effective design tradeoffs. Can you tell what 745 means by looking at it? What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada. A good design is correct, elegant and useful. I contend the 1st example tuple fails on 2 counts. – bbadour Sep 06 '10 at 01:38
  • 1
    @bbadour: If you grasp the issue, you'd have marked the answer. When I asked for a specific thing you had issue with, you evaded the question by attacking my answer. When I re-iterated my original example, you ignored it while supplying data of your own design asking "which is more useful and informative?". Now you require design to be "elegant" and "useful"... What that has to do with primary key criteria, I don't know. All I get from your latest response is that you comprehend how arbitrary a numeric value as a primary key is. – OMG Ponies Sep 06 '10 at 01:56
  • 2
    If you had given a clear, sensible, correct and useful answer, I would have marked the answer. You have not. If I cannot extract any sensible meaning from what you write, I cannot sensibly respond. I suggest that those, who think they extracted sensible meaning from what you wrote, merely projected their own assumptions onto it to fill in the gaps. You call that attacking your answer, and I call that honestly admitting my own limitations. – bbadour Sep 06 '10 at 14:16
  • @bbadour: The answer is being voted up consistently, which implies it makes sense to others. I've tried asking for specifics (what you call "gaps") that you don't understand, and you reply--but with non-sequitors. You consistently evade the actual subject, while admonishing my answer & those who vote in agreement for it. That is what being "combative" is--it's not admitting your limitations, because you aren't addressing any. – OMG Ponies Sep 06 '10 at 17:39
  • 2
    I can only suggest you look for the flaws in your answer instead of flaws in me. Your resources would be better allocated that way by any number of measures. – bbadour Sep 06 '10 at 20:51
  • @bbadour: I'll address flaws when you actually describe them. You are the target audience – OMG Ponies Sep 06 '10 at 23:39
  • In my experience, natural keys are almost always the source of all drama in database design. About the only use they have is avoiding joins. The rest of the time you're wasting fixing issue such as when your natural key isn't as unique as you thought, or dealing with cascading updates that become burdensome. –  Sep 07 '10 at 11:36
  • 1
    @bbadour: sorry, but I'm with OMG Ponies here - from you comments I can't figure out what you are objecting to. His answer could be unclear, but can you concede your wording could be too as well? Since I consider this answer as pretty good and complete, I'm really interested in what I could miss. So could you try to put your arguments in other words, please (one flaw at a time)? – Alexander Malakhov Sep 08 '10 at 06:09
  • @bbadour, regarding your points: **1.** "Can you tell what 745 means by looking at it" - no, you just do *JOIN* and show what user needs, no problems here. **2.** "What does 3296013 achieve that was not already achieved by 840082470" - if you're sure 840082470 will not change till the end of the world, use it as PK. He clearly stated this in USA states example, right ? On the other hand, why do you have to guess whether social number or whatever will change in future or not ? IMO memory required for one integer column absolutely worth 100% guarantee, that your references will not be broken – Alexander Malakhov Sep 08 '10 at 06:45
  • 2
    @Alexander, you assume 1) the person knows how to join, 2) the person has authority to query the referenced table, 3) the person knows how to find the referenced table, 4) the data is still in the dbms where one can perform a join, 5) no cost accrues from discovering a join is required. If any of your assumptions is false, everything you predicate on them is pointless. – bbadour Sep 09 '10 at 00:17
  • @bbadour: The person knowing how to join has nothing to do with table design & primary key selection, nor does data being in the database or someone having to learn what the keys are. Column naming should be as informational as possible, but some databases (like Oracle, limiting to 32 characters) can make things difficult. You can't address table specific privilege(s) until *after* the tables are designed--until then you're relying on roles (assuming available). I'm not convinced whatever goal you have in mind is realistic, or if the goal is answering this question. – OMG Ponies Sep 09 '10 at 00:46
  • 1
    @Will - I disagree. Data migrations and consolidations are significantly easier with natural keys. Keep in mind that the use of a surrogate key does not obviate the need for another unique key on the table. Users must have means of differentiating one row from another by something other than the surrogate key. Think about the headache of consolidating two tables of States where the surrogate key for each State between the two systems are different. E.g. CA = 1 in system A and 99 in system B. – Thomas Sep 09 '10 at 01:45
  • 1
    @bbadour - Regarding your last post, am probably not the only one that has no idea where you are going with your comment. If you use a surrogate key, the system designer and the application designer must know to use them and the data must exist (along with the database, computer systems, planet earth...). So? – Thomas Sep 09 '10 at 01:47
  • @OMG Ponies - I'm not sure I've ever heard the term "artificial" key. Rather, I read that these were called surrogate keys. However, I'll grant you that "artificial" seems more diametric than "surrogate". – Thomas Sep 09 '10 at 01:49
  • @bbadour: **1)**, **3)** Well, if your developers can't figure it out or you can't explain it to them in 10 minutes, you are in biiig trouble :). **2)** It makes no sense to give access to main table and not to give to lookup table. **4)** How can it be elsewhere ? If you meant data corruption, this doesn't relate to design or even implementation. **5)** Do you mean performance issues ? If so, then yes - sometimes you need to de-normalize tables, but no one said you have to be zealous. And that is rare case anyway. So, have I convinced you ? – Alexander Malakhov Sep 09 '10 at 01:49
  • @Thomas: Very good point - "surrogate" is the more common terminology, but "artificial" from an explanation standpoint makes *way* more sense. I'm guilty of fighting terminology, like how people call derived tables/inline views *and* subselects "subqueries". My redemption lies with things making sense rather than buzzwords ...I think/hope =) – OMG Ponies Sep 09 '10 at 02:15
  • 1
    @Alexander and @Thomas: You seem to have some perverse notion that good database designers design only for a select group of expert users who never pull data out into a spreadsheet. What an absurd notion! – bbadour Sep 09 '10 at 12:36
  • 2
    @bbadour - There is no inherent means in a spreadsheet to enforce any of the elements of database theory. Why talk about the attributes of a good primary choice for people using a tool that cannot enforce the fundamental aspects of a primary key? Is it really that crazy that we'd assume that someone asking about elements of database design would be using a database management system of some sort? – Thomas Sep 09 '10 at 15:09
  • @bbadour: Sorry, I don't see what is your point. Do you mean "asking from developers to know what is *JOIN* " is too much, or that "users will have to deal with raw DB and they will be confused by surrogate keys" or what Thomas said ? – Alexander Malakhov Sep 09 '10 at 15:33
  • @bbadour: The database design is for optimally storing the data, not making it available to users--that's what UI is for. Expert spreadsheet users don't do hierarchical/recursive queries, and long for the speed of aggregate functions because they have to do that work *by hand*. – OMG Ponies Sep 09 '10 at 15:47
  • @Thomas I completely agree that, per your example, that a States table should use either the state abbreviation or the state name as its key. States don't ever change their name and are guaranteed to be unique. You picked one of the few instances where it is the perfect solution. One of the few, IMHO. –  Sep 09 '10 at 16:02
  • 1
    The five assumptions I enumerated are assumptions. During normal use of a database management system by various users, one or more of those assumptions will often be false. If someone cannot conceive of an assumption as an assumption or if someone cannot imagine a situation where an assumption could be false, I can offer no remedy. – bbadour Sep 11 '10 at 21:45
  • @OMG Ponies: You seem not to understand that the purpose of a database management system is to manage data for all users of the system, or that managing includes making data available to users of the system. – bbadour Sep 11 '10 at 21:52
  • After reading this entire comment section I am led to believe that @bbadour is being intentionally unhelpful and destructive. – Smeghead Sev Sep 22 '16 at 16:03
8

Database design starts with a conceptual data model (such as an entity relationship diagram) and finishes up with a database schema or schemas. Entities are mapped to tables; in this process one entity may be split into several table, several entities may be merged into a single table and new tables may arise (for instance, intersection tables to implement many-to-many relationships).

In an ERD entities have primary keys. These are natural keys, that is they are attributes of the entity. For a PERSON entity it might be SocialSecurityNumber. For an ORDER entity if might be OrderRef For an INVOICE entity it might be InvoiceNo. In the first case that is a real-life identifier; in the second case it is a smart key in an ugly format (2010/DEF/000023 ); in the third case it is a monotonically incrementing number because that is what the current paper-based system uses.

Natural keys can be fanciful. I once worked on a database design where the analyst had specified the CUSTOMER entity with a key of (FullName, Address, Sex, DateOfBirth, DistinguishingCharacteristics) on the basis that two individuals of the same name, birth date and gender could live at the same address.

The characteristics of an entity's primary key are:

  • unique
  • familiar
  • stable (presumed)
  • minimal (one or more attributes but as few as necessary)

When it comes to primary keys for database tables, natural keys are not always suitable.

There are many reasons not to use SSN as a physical primary key. Protection of a citizen's personal data is actually the most important but it is also the case that an individual's number can change. Primary keys should be unvarying.

Smart keys are dumb. They are actually compound keys compressed into a single column. They are better represented as separate columns, not least because it is a frequent requirement to search on single elements of the key. Also, the format of such keys can change.

In general compound keys are a pain as primary keys because we have to cascade multiple columns as foreign keys. This is exacerbated when the child's primary key is defined as a serial number within the parent's primary key. There are systems out there which dependent tables inheriting a nine-column foreign key from a parent when they have a scant two data columns of their own. Sometimes this sort of inheritance can be useful but mostly it is a just a hassle.

The characteristics of an entity's primary key are:

  • unique
  • appropriate (meaningless)
  • guaranteed stability
  • minimal, usually a single column (except for intersection tables)

So unless the candidate key is a meaningless identifier (such as InvoiceNo) a table should have a synthetic key (AKA surrogate key). This can be a monotonically incrementing number or a GUID according to your needs. Regarding intersection tables, if they have no other attributes or dependent tables there is no value in replacing a compound primary key (AKA composite key) with a synthetic one.

The crucial thing is: we still enforce the candidate keys. This means applying UNIQUE constraints on those columns - SSN , OrderRef - in the parent table. This is because a synthetic key uniquely identifies a row in a table, it does not uniquely identify the data.


Regarding familiarity

Familiarity is a curly one. It is an important consideration when it comes to we are identifying primary keys in a conceptual data model but it is less useful when it comes to database design.

In a commnet @bbadour provides two contrasting examples:

{3296013,840082470,Bob Badour,745} versus {840082470,Bob Badour,PE,CA}  

and poses the question:

"What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada."

Well, 840082470 is like a invoice number. Of itself it is a meaningless string of digits. If the system we are designing belongs to the domain of Canadian higher education then it is certainly acceptable as a candidate key. However, because it is a key apparently owned by an external central system (forgive me for not understanding the Canadian academic system), it is open to some of the objections to SSN as a primary key. We are reliant on that external system to ensure uniqueness, guarantee stability and verify identification.

As for 745 versus PE,CA, that is clearly wrong. The Canadian postal abbreviation for "Prince Edward Island" and the ISO digraph for "Canada" identify two distinct pieces of information and derive from different sources, so they should be represented as two separate columns. But let us focus on whether 745 or PE makes the better primary key.

First thing, the database doesn't care which data type we use for the code to represent "Prince Edward Island". It just wants guaranteed uniqueness.

Second thing, the user-facing part of the system is likely to display the full expansion "Prince Edward Island", in which case the application is going to need to execute a look-up anyway. This is because users of a system which also holds addresses from the country of Peru or the state of California will appreciate the clarity of the expanded names[1]. Certainly if we go beyond the few hard cases (such as state abbreviations) the application should always expand codes when displaying them to users.

Thus the only advantage of using PE rather than 745 is that it makes ad hoc querying easier.

Third thing, if the code expansion changes we might want to distinguish records which use the newer version. This is a lot easier if 745='Prince Edward Island' and 746='Prince Edward Is.' than if we use PE as the primary key.

Fourth thing, there are programming considerations. For instance, if the application developers have to provide drop-down lists using Java Enumerations they need numeric codes.

In short, familiarity of natural keys is not as useful as the practicality of surrogate keys.

[1] Canadians will know that CA stands for Canada. But does MO stand for Morocco, Monaco, Moldova, Montenegro, Mongolia or Montserrat? Actually none of them: it's Macau.

APC
  • 144,005
  • 19
  • 170
  • 281
  • 1
    I understand how keys relate to logical data models. I don't see what they have to do with the physical at all. – bbadour Sep 03 '10 at 04:06
  • @bbadour - I have expanded my answer. My original response was a placeholder as your question had four close votes, and indeed attracted the fifth while I was editing. – APC Sep 03 '10 at 04:16
  • 1
    Fair enough; however, I still don't see what keys have to do with the physical at all. It seems you include "unique identification", "unvarying", "not compound". Not the words I would use, but certainly 3 design criteria. I am somewhat amused that you equate drawing ER diagrams with logical modeling. You seem to say "physicalizing" involves both primary and foreign keys. I don't see what either have to do with the physical. – bbadour Sep 03 '10 at 04:42
  • 1
    I am amused because ERM is a methodology for **conceptual analysis** as is NIAM, for example. ERD is a graphical language used for conceptual analysis, as is ORM, for example. None of that really has anything to do with **logical design**; although, tools for ERD and ORM often blur things by offering features to automate large parts of logical design. I downvoted your answer for confusing some of the most fundamental vocabulary in IT as reflected in the ISO/IEC 2382 standard vocabularies. – bbadour Sep 03 '10 at 23:39
  • 1
    You fixed some of the most egregious bits so I removed my downvote. Thank you! – bbadour Sep 04 '10 at 13:30
  • 3
    It's rare, but there have also been instances of duplicate SSN's (I suppose the social security administration fixes it at some point though) – wadesworld Sep 05 '10 at 12:49
  • 1
    Even without the IRS assigning duplicate SSN's, one must expect to encounter them due to illegal immigrants making up SSN's to file their taxes. Being undocumented means getting assigned a date for a deportation hearing and released. Tax evasion gets prison time. – bbadour Sep 06 '10 at 00:53
  • I'd upvote the answer if SSN weren't used as an example -- it's a terrible example of a key, because of fraudulent duplication, reassignments, mistaken duplication, and the fact that not everybody has one! – Dave Markle Sep 09 '10 at 22:18
  • @DaveMarkie - In my answer I say that that SSN is not suitable as a primary key, although I admit the reasons I give are not as exhaustive as your list. – APC Sep 10 '10 at 04:36
4

A Primary Key is a key that uniquely identifies an entity. When you are choosing a primary key, the best choice is almost always a surrogate key that has absolutely nothing to do with the entity at all other than uniquely identifying it.

And that's it. There are supposedly rare edge cases where a primary key might be a natural key, but I've never seen a valid one.

Most of us use a 32-bit auto-increment integer as a primary key. Another excellent choice (in certain circumstances) is a UUID.

Randolpho
  • 55,384
  • 17
  • 145
  • 179
  • 1
    You seem to agree that no design criteria exist. Is that correct? – bbadour Sep 03 '10 at 02:53
  • 1
    @rockinthesixstring So you think uniqueness is the only criterion. Is that correct? – bbadour Sep 03 '10 at 03:32
  • 1
    If your answer is "The only criterion for choosing a good primary key is uniqueness", why didn't you say that in your answer instead of pasting in a definition from somewhere else? – bbadour Sep 03 '10 at 03:56
4

A candidate key is a set of attributes that are irreducibly unique (irreducible meaning that no attribute can be removed from the key without losing the uniqueness property).

Other criteria when choosing what candidate keys to implement are: simplicity, stability, familiarity.

These three criteria are important considerations but not necessarily essential attributes of a key. For instance it may be desirable and quite reasonable to enforce a key that can change often. e.g.: a user login name is required to be unique but the user may change it at will as long as it remains unique.

A primary key is a candidate key.

nvogel
  • 24,981
  • 1
  • 44
  • 82
3

Hey. it's open again. Here goes.

(1) Choose good candidate keys.

It does not pertain to the database designer to choose candidate keys. The database designer has the responsibility to see to it that all the uniqueness requirements he is informed of by the user, will be enforced. So it is the user who "chooses" what the candidate keys are.

There are two scenario's I can think of that relax this unequivocal position a bit.

One is if the user says that some attribute of type 'video' or 'audio' (or some such) is to be unique. It may be infeasible to actually enforce that, and it is the designer's responsibility to point that out to the user (as it is also his responsibility to point out that 'uniqueness' of audio and video content is a very debatable subject, and that any uniqueness on such attribute values, even if enforcible by the system, still has a good chance of not being the same uniqueness that the user wants).

Second is how the picture gets muddied by the possibility of distinct logical designs all addressing the same problem. If D1 and D2 are both valid designs addressing the same problem, then it might be the case that a certain given uniqueness rule imposed by the user, is enforcible using keys in D1, but not in D2. From this perspective, "choosing candidate keys" can be interpreted as "choosing a particular design such that a given uniqueness rule is enforcible using keys". But that wasn't really the question that you asked.

(2) Choose good primary keys.

A while ago, Darwen launched the question "What are good reasons to single out one particular candidate from among the others as being 'primary' ?". Nothing much came out, except then perhaps : "to suggest that this particular key is the preferred one to use whenever making references to this relvar". I suspect they didn't find that convincing enough to change their earlier decision that "no key is more unique than any other".

But, supposing that nonetheless there exists some valid reason to single out one particular key as "primary", I suppose the following considerations apply :

  • the likeliness, or appropriateness, of using this primary key also as, e.g., the clustering key in the physical design.
  • and as a consequence of that, the probability of having to change a value of some existing primary key. Key values that are highly stable will be preferable over key values that are more volatile.
  • the percentage of the business that naturally uses some such key in their daily operations.
  • if the required space for physically encoding key values is significantly different, which one has the smallest encoding size.
Erwin Smout
  • 18,113
  • 4
  • 33
  • 52
  • I prefer to see an answer that more clearly separates the physical considerations. Keys are logical constructs. Insofar as some product limitations might force one to choose different keys to implement partitioning, clustering or indexing, I prefer an answer that makes clear those decisions relate specifically to working around product limitations--not good design. – bbadour Sep 06 '10 at 01:22
  • 1
    I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour Sep 06 '10 at 01:28
3

Your answer to Erwin: "I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour 8 hours ago "

Your original question was about 'primary keys'. Now you change your focus to keys and foreign keys. A key is an integrity constraint, so the only criteria are that a minimal set of attributes has to be unique in a relation (uniqueness and irreducibility). If we change our focus to foreign keys then simplicity, stability and familiarity are the criteria to choose from all the candidate keys in de referenced relation. There could be more candidate keys that fulfill that criteria to more or less the same extend. If we look at familiarity, one candidate key could be very familiar to a group of users and not to another group for which another candidate key is more familiar. Think about different views or subschemas of a database. This second group of users should choose a different candidate key for reference purposes (as foreign key). If you insist in 'primary keys' of which we only have one per relation then I have to ask what makes a key more primary than others. I think the term primary key should not be used. At least at the logical level. Also the term 'foreign keys' is not well chosen (foreign keys are not keys, but references).

So, I think the remarks of Erwin about ‘primary’ keys were very much to the point. Or at least this was my interpretation of what he means.

Do you agree with this? If so, would you change your original question to "What are the design criteria for keys and what are the criteria to choose a foreign key from the available candidate keys?"? If not, why?

Regards, Carlos

  • 2
    Carlos, my original question was about database design and emphasized that fact. Further, I clearly tagged the question as being about foreign keys right from the start as one cannot discuss the topic of choosing good primary and candidate keys without involving foreign keys. Others have removed the emphasis and altered the tags. I merely clarified the role of foreign keys in my question when I put the tag back so that the next passerby won't remove it again. – bbadour Sep 06 '10 at 13:18
  • Both Erwin's answer and your answer are good enough to get a vote from me, but neither answers the question directly enough, clearly enough or elaborated enough to warrant the bounty. While your answer is the 1st to fully enumerate the design criteria, it does so as an aside while nitpicking the question instead of answering it. No, I won't change the question for 2 important reasons: 1) brevity and 2) clarity. – bbadour Sep 06 '10 at 13:28
  • 1
    @bbadour: My answer was also about 'logical' database design. And I think we should not talk about foreign keys referencing primary keys but candidate keys. I think I was clear about that. As an aside: candidate to what? to become primary? Lets drop the ‘candidate’ thing too: simply ‘keys’. I had to ask what makes a primary key more primary than others. I also explained that the criteria are different for determining if a set of attributes should be a key and for choosing a foreign key. Bottom line: talking about 'primary' keys is problematic. – Carlos M. Calvelo Sep 06 '10 at 14:12
  • @bbadour: Even if you don’t change your question, would you agree that my reformulation (trying to forget about ‘primary’ keys) of your question is less problematic? As an aside: It was not my intention to give an elaborated answer (I’m not doing homework here) or to get a bounty. If I understand and I’m being understood I’m happy. – Carlos M. Calvelo Sep 06 '10 at 14:13
  • No, I would not agree. A candidate key is a candidate for referencing. A primary key provides guidance on how best to reference a relation. I don't find my question problematic. I would hope that anyone motivated to answer a question would be motivated to provide their best answer regardless of incentive. ::shrug:: – bbadour Sep 06 '10 at 20:48
  • @bbadour: I was wrongly thinking the terms candidate and primary were being used in a more formal and restrictive sense, but they are only for pragmatic guidance in design. I understand now why you would not change your question. Thank you for the clarifications. – Carlos M. Calvelo Sep 06 '10 at 23:18
2

A primary key is a candidate key chosen for special treatment, so first we must look at the properties of candidate keys. A set of one or more columns is a candidate key if it has the following two properties:

Uniqueness: A candidate key must uniquely identify each row in a table. No table may contain two rows with the same value for the candidate key.

Irreducability: Removing any column from a candidate key must violate the uniqness property. In other words, no subset of columns in a candidate key is itself a candidate key.

If no candidate key exists, and sometimes even if one does, a surrogate key is often created using an auto-incrementing integer column, or made up using some other technique. This surrogate key is now also a candidate key.

It is often useful to choose among the available candidate keys and to designate one of them as the primary key. The first criteria often applied is simplicity indicating the candidate key with the fewest columns. However there are other potential criteria, like familiarity, familiar values being more useful than non-familiar values, and stability, stable keys being less troublesome than keys that are apt to change. These criteria however, are strictlty outside the scope the relational model, often conflict with each other, and are often made to deal with implementation limitations.

I would say that the first two concepts "uniqueness" and "irreducability" are less design criteria than fundamental properties of primary keys, while the latter concepts of "simplicity", "familiarity" and "stability" are more properly labeled design criteria, as they involve tradeoffs and subjectivity.

Why choose a primary key? Simplicity and familiarity are not only criteria for choosing among available candididate keys, but are why we should choose a primary key at all. If there are are multiple candidate keys in a table, it simplifys things if all foreign keys pointing to that table refer to the same candidate key. Furthermore, the very act of choosing a particular candidate key will help make it familiar.

2

What are the criteria?

A PRIMARY KEY is something that will define the entity, only the entity and nothing but the entity.

  • You can take it from the outside world. Say, a star catalog number to identify a star (good example), or an SSN to identify a person (bad example).

    In this case, you rely on the outside world.

    • Do all people have SSN? (They don't).
    • Are SSN's unique? (They aren't).
    • Can an SSN be assigned to another person? (It can).
  • You can generate it inside your model, using AUTOINCREMENT or GUIDs or whatever.

    In this case, you rely on yourself and your database skills.

    • Do all people in your model have an ID? (Yes, they do, otherwise they wouldn't be in the table with ID NOT NULL).
    • Are these ID's unique? (Yes, they are, the PRIMARY KEY constraint takes care of it).
    • Can they be assigned to other persons? (No, they cannot, they are either non-repeatable by design or auto incrementing).

    Or another set of answers:

    • Do all people in your model have an ID? (No, they don't, the people table was accidentally dropped, though some other information retained).
    • Are these ID's unique? (No, we failed to merge two versions of the database properly).
    • Can they be assigned to other persons? (Yes, we reset the AUTOINCREMENT by mistake).

The most important thing is that a surrogate key is a feast that is always with you. You can always create a surrogate key: nothing on Earth can stop you from declaring an AUTOINCREMENT field. But by far not all things have some kind of identifier everybody agrees upon.

However, a good natural key cannot be overemphasized.

Guide Star Catalog database is most probably backed up more reliably than yours, and the list of US state codes you always can restore right from the memory.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
1

Only one really, choose a surrogate for each table (identity/auto_number) or something similar that the users will never even see so you can do whatever is necessary with them whenever you need to now and in the future.

Tahbaza
  • 9,486
  • 2
  • 26
  • 39
  • So you are saying there are no criteria at all. Is that correct? – bbadour Sep 03 '10 at 02:52
  • 2
    It depends on whose criteria you refer to. Primary keys should be internal to a database and determined WITHOUT end user input in my opinion for performance and maintainability reasons. – Tahbaza Sep 03 '10 at 03:02
  • I thought I was clear when I used words like "choose", "good", "database design", "design task" and "design criteria". Was I not? – bbadour Sep 03 '10 at 03:35
  • Your reply does not answer the question. The use of a surrogate key does not obviate the need to choose other keys. – nvogel Sep 03 '10 at 05:41
1

(Not quite sure how to interpret this question. Sounds like a quiz or something where you are looking for one single "right" answer from a textbook. I'm going to interpret the question as a more practical one, hence my advice below.)

At least in the MS SQL world, discussion about a proper Primary Key is inevitably wrapped up in discussion about the proper clustered index for a table. The two don't have to be the same, but they are by default, and for many tables, making the two the same is often a good idea.

For the purpose of our discussion here, its important to distinguish between the two:

A PRIMARY KEY is a field or combination of fields that uniquely identify a row.
A CLUSTERED INDEX is a field or combination of fields that represents the physical ordering of a table. (Again, I am speaking about MS SQL Server, not sure how other RDBS might handle this)

Key to the remainder of my discussion is knowing that since SQL 7.0, the clustered index key is used as a row identifier for all non-clustered indexes. This means that many of the same criteria for choosing a good clustering key are the same as for choosing a good primary key.

Let's first look at the criteria for a good clustered index (From Kimberly Tripp's excellent article). A clustered index should be:

  1. Unique - otherwise useless as a row identifier for other indexes
  2. Narrow - this key is used in other indexes, so should be as narrow as possible
  3. Static - If key values change, then references become invalid and will need updating
  4. Ever-increasing - To reduce physical table fragmentation as new rows are added

It is readily apparent the first 3 are also good criteria for a primary key. #4 is a bonus that will reduce table fragmentation as tables grow.

A GUID as a primary key, as popular as that is, actually fails 2 of these criteria (Narrow and Ever-Increasing). As such, it is not recommended as a PK/Clustered index in most circumstances (see Kim's related article here)

BradC
  • 39,306
  • 13
  • 73
  • 89
  • 1
    Thank you for your answer Brad. My question is not product-specific. For your benefit, I have asked a new question [How does the primary key keyword relate to clustered indexes in MS SqlServer?](http://stackoverflow.com/questions/3662163/how-does-the-primary-key-keyword-related-to-clustered-indexes-in-ms-sqlserver) Here, I can give your answer a vote. There, I can pick it as the best answer. If you would be so kind as to copy and paste it there, you would serve this site very well. Thank you! – bbadour Sep 07 '10 at 19:53
-2

I'm going to say something here that is not expected.

All the stuff they teach in database about normalization and keys is all wrong when it comes to choosing primary keys.

The primary key is special when it comes to range queries, and for that reason if you have a dominant range query that is your primary key, no exceptions.

If your dominant range query is not on a candidate key you end up with a primary key that is not enforced for uniqueness! This is sometimes called a clustered index, which is a misnomer because there is no index.

Now the normalization and candidate keys are all important, and you will want to enforce unique constraints on at least some of them. But do not assign the primary key because it is the natural key. In fact, this is slower than defining an index and a unique constraint. Define the primary key based on range queries only.

Remember, there is no constraint to actually have primary keys. A table with no primary keys is called a heap table and has either no intrinsic ordering or insertion order intrinsic ordering.

EDIT: definition of range query:
      A range query is a query that is an ORDER BY query or contains either a greater than or less than operator. What we are interested in are the columns for which these queries run on. The fundamental idea is a range query fetches several (tens to hundreds to perhaps thousands but not all) rows from the table based on bounding conditions at one or both ends.
      There is another kind of range queries, and that is where you have a foreign key to another table and an operation is select all matching on that foreign key. This is in fact also a range query although not obviously so.

Joshua
  • 40,822
  • 8
  • 72
  • 132
  • 3
    @Joshua: I cannot make any sense of your answer at all. Range queries don't seem to have anything at all to do with keys. Did you mean intervals? Intervals and the sorts of problems identified and solved in Date, Darwen and Lorentzos' book on Temporal Databases? – bbadour Sep 06 '10 at 01:53
  • Intervals are a type of range queries. – Joshua Sep 06 '10 at 20:27
  • 2
    @Joshua - You seem to be conflating "Primary Key" with "Clustered Index". In Microsoft SQL Server it is true that PKs will default to being the clustered index if you don't specify any different but they are entirely different concepts. – Martin Smith Sep 06 '10 at 20:46
  • 1
    @Joshua: I still cannot make much sense of your answer. It seems you are saying that whatever foreign key reference is most queried defines the primary key. However, during design, one can decide which of several candidate keys to propagate to another relation for the purposes of referencing. – bbadour Sep 06 '10 at 21:03
  • @Martin: no I am not. Just try to declare a primary key and clustered index that differ. – Joshua Sep 06 '10 at 21:04
  • @bbadour: you are aware that there is actually no requirement that every foreign key relationship use the same key of the potential candidate keys right? – Joshua Sep 06 '10 at 21:05
  • 1
    @Joshua - That is perfectly possible (I'm still assuming you are talking about SQL Server. Syntax Follows) `create table #X (col1 int not null primary key nonclustered,col2 int) create clustered index ix ON #X (col2)` – Martin Smith Sep 06 '10 at 21:17
  • 2
    @Joshua: you are misinformed. Keys and indexes are completely different concepts. Keys and *clustered* indexes even more so. I don't know of any DBMS that requires a primary key and a "clustered index" to refer to the same set of columns - certainly not SQL Server. However, the topic is keys generally, it's not a question about the limitations of some particular software so your answer is irrelevant. – nvogel Sep 06 '10 at 21:22
  • ah "primary key nonclustered" -- never heard of that before. – Joshua Sep 06 '10 at 22:45
  • 1
    Yes, Joshua, I am aware of that. I am not aware of any difference that makes to whether designers decide which candidate keys to use for different references. Now that you have heard of a nonclustered primary key, might you consider changing your answer or deleting it? – bbadour Sep 06 '10 at 23:41
  • 3
    Joshua: in my opinion, you underestimate the importance of primary keys in regular databases if you design them based on some "dominant query". – Stefan Steinegger Sep 07 '10 at 09:42
  • The day should come when I write my own book on databases. I have reason to believe the theory as taught in the classroom is actually wrong when it comes to practical applications. I appear to have underestimated the value of primary keys. I have simply transferred that value to candidate keys. – Joshua Sep 07 '10 at 15:12
  • 1
    @Joshua: A primary key is a candidate key. I don't know how you "underestimate" the value of them. Candidate keys are essential to getting correct results from relational database design. The topic of this question is design theory and that is not invalidated by the limitations of indexes in one or other versions of one particular DBMS. If the limitations of your DBMS software make some aspect of good design seem inconvenient to you then maybe you should use some better software rather than rewrite the textbooks to suit it. – nvogel Sep 08 '10 at 21:51
  • The difference between primary key and candidate key is there can be only one primary key. – Joshua Sep 08 '10 at 23:27
  • 1
    This answer (and your subsequent commentary) fundamentally confuse the logical and physical levels, and, despite alluding to having had an education covering databases, betray your complete ignorance of the literature around this topic. Your assertion that the (mathematically grounded) relational theory is wrong is not accompanied by any formal disproof. I highly recommend that you review your learning materials again and reconsider your position, if you have not already. –  Sep 11 '12 at 18:30
  • I do not confuse the logical and physical levels. The consequence of ignoring the physical level in early database design is far too great to accept. – Joshua Sep 11 '12 at 20:06