surrogate vs natural key: hard numbers on performance differences?

Question

There's a healthy debate out there between surrogate and natural keys:

My opinion, which seems to be in line with the majority (it's a slim majority), is that you should use surrogate keys unless a natural key is completely obvious and guaranteed not to change. Then you should enforce uniqueness on the natural key. Which means surrogate keys almost all of the time.

Example of the two approaches, starting with a Company table:

1: Surrogate key: Table has an ID field which is the PK (and an identity). Company names are required to be unique by state, so there's a unique constraint there.

2: Natural key: Table uses CompanyName and State as the PK -- satisfies both the PK and uniqueness.

Let's say that the Company PK is used in 10 other tables. My hypothesis, with no numbers to back it up, is that the surrogate key approach would be much faster here.

The only convincing argument I've seen for natural key is for a many to many table that uses the two foreign keys as a natural key. I think in that case it makes sense. But you can get into trouble if you need to refactor; that's out of scope of this post I think.

Has anyone seen an article that compares performance differences on a set of tables that use surrogate keys vs. the same set of tables using natural keys? Looking around on SO and Google hasn't yielded anything worthwhile, just a lot of theorycrafting.

Important Update: I've started building a set of test tables that answer this question. It looks like this:

PartNatural - parts table that uses the unique PartNumber as a PK
PartSurrogate - parts table that uses an ID (int, identity) as PK and has a unique index on the PartNumber
Plant - ID (int, identity) as PK
Engineer - ID (int, identity) as PK

Every part is joined to a plant and every instance of a part at a plant is joined to an engineer. If anyone has an issue with this testbed, now's the time.

It probably depends on the data type of the surrogate and natural keys, and how they're indexed, etc., etc. — John Saunders, Aug 04 '09 at 18:39
Given that people use surrogate keys because they are logically necessary, the performance differences between then and natural keys (should such differences exist) nust be immaterial - you can't replace one with the other. — , Aug 04 '09 at 18:42
I think natural keys usually involve varchar fields and surrogate keys are almost always ints. You're right though, but I hope that whatever mythical study/whitepaper I'm searching for will address this. — jcollum, Aug 04 '09 at 18:42
Natural keys involve whatever's natural. surrogate keys may be ints, bigints, GUID, etc. — John Saunders, Aug 04 '09 at 18:46
@Neil: I'm not sure I agree with the logically necessary part. From what I've read, many people use surrogate keys even when a natural key might be available, for a variety of reasons. — jcollum, Aug 04 '09 at 18:56
@jcollum Not to start that debate here, but surrogate keys should be used whether there is a natural key or not (and there almost always is). And note my use of the phrase "Given that..." — , Aug 04 '09 at 19:11
surrogate keys are NOT "logically necessary". how about a log table, the date and time (represented with sufficient precision) is a perfectly acceptable natural key. — Charles Bretana, Mar 12 '14 at 14:18

Charles Bretana · Accepted Answer · 2021-01-02T19:42:23.647

9

Use both! Natural Keys prevent database corruption (inconsistency might be a better word). When the "right" natural key, (to eliminate duplicate rows) would perform badly because of length, or number of columns involved, for performance purposes, a surrogate key can be added as well to be used as foreign keys in other tables instead of the natural key... But the natural key should remain as an alternate key or unique index to prevent data corruption and enforce database consistency...

Much of the hoohah (in the "debate" on this issue), may be due to what is a false assumption - that you have to use the Primary Key for joins and Foreign Keys in other tables. THIS IS FALSE. You can use ANY key as the target for foreign keys in other tables. It can be the Primary Key, an alternate Key, or any unique index or unique constraint., as long as it is unique in the target relation (table). And as for joins, you can use anything at all for a join condition, it doesn't even have to be a key, or an index, or even unique !! (although if it is not unique you will get multiple rows in the Cartesian product it creates). You can even create a join using non-specific criterion (like >, <, or "like" as the join condition.

Indeed, you can create a join using any valid SQL expression that evaluate to a boolean.

edited Jan 02 '21 at 19:42

answered Aug 04 '09 at 19:18

Charles Bretana

143,358
22
150
216

2

I think you are misusing the term "corruption" here (which implies random values being scribbled over the database data) - you really just mean "consistency". – Aug 04 '09 at 19:56
Perhaps, w/o picking at nits, I guess I (mis?)use the word corruption in a broader sense than I should... If there are two rows in a Transaction table that actually represent the same transaction, is that corrupt data or inconsistent data? And I have thought inconsistency specifically meant when some invariant was not satisifed, like, for e.g., if the totalAmount column in an Invoice row was not equal to the sum of the Invoice Line item row amounts. – Charles Bretana Aug 04 '09 at 21:22
"THIS IS FALSE. You can use ANY key as the target for foreign keys in other tables. It can be the Primary Key, an alternate Key, or any unique index or unique constraint." Yet Edgar Codd wrote in his 1970 seminal paper [*A Relational Model of Data for Large Shared Data Banks*](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf): "We shall call a domain (or domain combination) of relation *R* a *foreign key* if it is not the primary key of *R* but its elements are values of the primary key of some relation *S* (the possibility that *S* and *R* are identical is not excluded)." – Géry Ogam Jan 02 '21 at 18:17
@Maggyero An SQL FK references an SQL unique column list. A relational FK is generally understood to reference a CK. (PKs don't matter to relational theory, CKs do.) SQL does not faithfully use the relational model, terms mean different things in SQL than in the relational model, Codd's papers tend to be unclear, imprecise, flawed & superceded, and foundations of the relational model have become much better since 50 yrs ago. [Re SQL vs relational.](https://stackoverflow.com/a/46029100/3404097) – philipxy Jan 03 '21 at 03:27
@philipxy Interesting, in other words you think that Codd’s definition of a foreign key of a relation *R* which is any subset of a primary key of a relation *S* (any *proper* subset when *R* = *S*) should be relaxed to any subset of a relation candidate key of a relation *S* (any *proper* subset when *R* = *S*). – Géry Ogam Jan 03 '21 at 13:21
@philipxy Thanks for the link, I have posted a shorter and more formal [answer](https://stackoverflow.com/a/65550235/2326961) to complement yours. – Géry Ogam Jan 03 '21 at 13:30
1

@Maggyero What I said was how the words get used. And what the asker means depends on what they meant, and they don't say. And whatever you mean, you need to say. What matters is what terms are referring to, not how they are spelled. Also any particular thing Codd wrote in 1970 is not necessarily relevant. What matters is a modern understanding. Also from a relational point of view, all constraints should be declared & the notion of stopping at some variation like references to PKs, CKs or superkeys is misconceived. – philipxy Jan 03 '21 at 16:32
@philipxy I have just read Date’s 2015 [*Codd’s First Relational Papers: A Critical Analysis*](https://www.dcs.warwick.ac.uk/~hugh/TTM/CJD-on-EFC's-First-Two-Papers.pdf) that you linked in [another answer](https://stackoverflow.com/a/46029100/2326961) and you are right, Codd’s two restrictions on foreign keys seem obsolete: "Foreign keys were invented by Codd, but his definition of the concept changed several times over the years. The first definition, in the 1970 paper, was the one just quoted. – Géry Ogam Jan 03 '21 at 19:10
… That definition includes the strange and clearly unnecessary restriction that a foreign key not be the primary key of its containing relation (or relation variable, rather). Codd later and silently dropped that restriction, but he never dropped the restriction that the target of a foreign key had to be a primary key specifically. {Footnote: To its credit, SQL never abided by either of these restrictions.}" – Géry Ogam Jan 03 '21 at 19:10
@Maggyero As I said, it's not helpful to think of there being a "restriction" because all that is being determined is what a word denotes, not what should be done in DB design. There are reasons one might prefer to name or implement certain things or not. If key means CK then it's reasonable to say that SQL goes past FKs to (an SQL analogue of) foreign superkeys. And whatever it does or doesn't, so what? Extant SQL DBMSs stop there & don't implement arbitrary constraints. That's a restriction. PS Please put further comments on your linked answer so we don't bother the author of this answer. – philipxy Jan 03 '21 at 23:51

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

3

Natural keys differ from surrogate keys in value, not type.

Any type can be used for a surrogate key, like a VARCHAR for the system-generated slug or something else.

However, most used types for surrogate keys are INTEGER and RAW(16) (or whatever type your RDBMS does use for GUID's),

Comparing surrogate integers and natural integers (like `SSN`) takes exactly same time.

Comparing VARCHARs make take collation into account and they are generally longer than integers, that making them less efficient.

Comparing a set of two INTEGER is probably also less efficient than comparing a single INTEGER.

On datatypes small in size this difference is probably percents of percents of the time required to fetch pages, traverse indexes, acquite database latches etc.

And here are the numbers (in MySQL):

CREATE TABLE aint (id INT NOT NULL PRIMARY KEY, value VARCHAR(100));
CREATE TABLE adouble (id1 INT NOT NULL, id2 INT NOT NULL, value VARCHAR(100), PRIMARY KEY (id1, id2));
CREATE TABLE bint (id INT NOT NULL PRIMARY KEY, aid INT NOT NULL);
CREATE TABLE bdouble (id INT NOT NULL PRIMARY KEY, aid1 INT NOT NULL, aid2 INT NOT NULL);

INSERT
INTO    aint
SELECT  id, RPAD('', FLOOR(RAND(20090804) * 100), '*')
FROM    t_source;

INSERT
INTO    bint
SELECT  id, id
FROM    aint;

INSERT
INTO    adouble
SELECT  id, id, value
FROM    aint;

INSERT
INTO    bdouble
SELECT  id, id, id
FROM    aint;

SELECT  SUM(LENGTH(value))
FROM    bint b
JOIN    aint a
ON      a.id = b.aid;

SELECT  SUM(LENGTH(value))
FROM    bdouble b
JOIN    adouble a
ON      (a.id1, a.id2) = (b.aid1, b.aid2);

t_source is just a dummy table with 1,000,000 rows.

aint and adouble, bint and bdouble contain exactly same data, except that aint has an integer as a PRIMARY KEY, while adouble has a pair of two identical integers.

On my machine, both queries run for 14.5 seconds, +/- 0.1 second

Performance difference, if any, is within the fluctuations range.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 04 '09 at 18:44

Quassnoi

413,100
91
616
614

The example that I see most often of a surrogate vs natural key is States (in the US). Surrogate key on that table would be an int, but the natural key would be a char(2). – jcollum Aug 04 '09 at 18:48
I wonder if SSN will change from an INT to a CHAR when the US goes past 1B people. It's not too far off really. – jcollum Aug 04 '09 at 18:53
oops, apparently the US's growth rate isn't even close to that of the world, we've got a way (in years) to go to hit 1B people. – jcollum Aug 04 '09 at 18:59
@Javier: that means "0.0001 or something aroung that" :) – Quassnoi Aug 04 '09 at 19:11
I certainly disagree that natural keys only differ in value, not type. A natural key could be a company's name by state, couldn't it? It would certainly be unique (assuming that company names are unique by state, which seems reasonable). So why not use that as a natural key? Or are you in favor of using natural keys only in cases where the natural key is some arbitrary level of un-complexity? Like a part number that's alphanumeric but only A-Z, 0-9? Surrogate keys are typically much less complex, usually an int, guid or bigint. – jcollum Aug 04 '09 at 20:24
@jcollum: Could you please give me the definition of the term `surrogate key` as you understand it for the purposes of this discussion. You may want to read this Wikipedia article: http://en.wikipedia.org/wiki/Surrogate_key This question is crucial for the further discussion so I'd really ask you to answer it before this discussion can continue. Thank you. – Quassnoi Aug 04 '09 at 21:11
@Quassnoi: I think my use of the term matches this: "the surrogate key can be the primary key, generated by the database management system and not derived from any application data in the database" (from the link you gave, which I'd already read) – jcollum Aug 04 '09 at 21:45
@jcollum: ok. So we have two tables that hold companies: first one uses a `VARCHAR(200)` column filled with `NEWID()` as a `PRIMARY KEY`. `NEWID` is certainly "generated by the `RDBMS`", and certainly "is not derived from any application data in the database". The second table uses a `VARCHAR(200)` column filled with company name as a `PRIMARY KEY`. Now we see two identical tables whose keys differ only in type, not value, but the first table uses surrogate key, while the second one uses natural key. Now, what is that you disagree with? – Quassnoi Aug 04 '09 at 22:13
Read above as "whose keys differ only in value, not type". – Quassnoi Aug 04 '09 at 22:15
Typically, in the uses of surrogate keys that I've seen, the surrogate key is an int, bigint or a guid. Your example uses a natural key (VARCHAR(200)) and a surrogate key (VARCHAR(200)). That seems artificial. Who would choose a varchar field for a surrogate key? Why 200 characters? The number of possible rows in that table would probably take up more storage than you can find on Earth (9 E 282 rows by my math). – jcollum Aug 04 '09 at 22:32
(yipes, I just assumed 26 ^ 200, the number would be a lot larger than that!!) – jcollum Aug 04 '09 at 22:41
1

@jcollum: when I asked you of the real world problems yesterday you answered me that it's not relevant since we are speaking about concepts, not real world problems. Now, when I give you a concept, you are starting to speak about real-world usage and change the subject of discussion. You didn't answer my question. Aren't the keys in the example above surrogate and natural, respectively? Don't they differ only by value, not by type? If they are and they do, what was that you disagree with? Please answer. Thank you. – Quassnoi Aug 04 '09 at 23:14
@Quassnoi: I was referring to not getting into specifics about business rules: how many Account Execs can be attached to a Company for instance. I think your example addresses natural keys in many-to-many tables. Which is fine, but not something I'm concerned about. Composite (i.e. int, int) keys in many-to-many tables seems like a reasonable thing to do. I'm more concerned about natural vs. surrogate keys in primary entity tables (Part, Company, Employee etc.). – jcollum Aug 10 '09 at 16:42
`@jcollum`: you are conserned about performance of `INT32` comparison or what? This will be the same. A database will compare `32-bit` integers filled with part numbers with exactly same efficiency as ones filled with identity. Performance of `CMP EAX` does not depend on the source of the number being compared. – Quassnoi Aug 10 '09 at 17:05

surrogate vs natural key: hard numbers on performance differences?

2 Answers2

Natural keys differ from surrogate keys in value, not type.

Comparing surrogate integers and natural integers (like `SSN`) takes exactly same time.

On my machine, both queries run for 14.5 seconds, +/- 0.1 second

Linked

surrogate vs natural key: hard numbers on performance differences?

2 Answers2

Natural keys differ from surrogate keys in value, not type.

Comparing surrogate integers and natural integers (like SSN) takes exactly same time.

On my machine, both queries run for 14.5 seconds, +/- 0.1 second

Linked

Comparing surrogate integers and natural integers (like `SSN`) takes exactly same time.