Adding an artificial primary key versus using a unique field

Question

Recently I Inherited a huge app from somebody who left the company.

This app used a SQL server DB .

Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.

This is done for every table no matter if other fields could be unique and define primary key.

Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?

I personally do not see any benefits. Only thing I can think of is maybe search based on an int is faster? but im not an expert in Db optimization. Even if this is true unless table is gigantic different is negligible. — , Jul 28 '22 at 17:16
There is an obvious performance benefit to having a key that fits inside a processor register, for both index maintenance and comparisons. That plus the ability to use an `IDENTITY` for such columns (which minimizes index fragmentation) means it's often worthwhile to do this, even if you're going to be adding `UNIQUE` constraints on the "real" keys anyway for consistency (since those separate indexes are easier to maintain than what is presumably the clustered index). There's also a big contingent of purists who think surrogate keys are the devil's toys, though. :P — Jeroen Mostert, Jul 28 '22 at 17:16
@jeroen : The question is if this is needed for a small table or average size table. — , Jul 28 '22 at 17:19
@BTFman: It is never *needed*, but it is often *beneficial*. Even if the table itself is small, there may be large tables referring to the smaller tables through foreign keys. Repeating the domain key values there can add up (although recent innovations like columnstores may make it less important again). If "JohnJacobJingleheimerSchmidtsApplication" logs into your site 100 times per second, it can certainly make a difference if you store that in the "Requests" table, or "1". — Jeroen Mostert, Jul 28 '22 at 17:22
You have to balance what @JeroenMostert is saying with the fact that if you use a natural key then the foreign-key child tables don't need joining just to retrieve it (because the foreign key columns are the same as the parent's). There is also the consideration of querying: one doesn't normally query by the surrogate key, so clustering on, say, a date is more common. Also the natural key may be quite small: for example countries or states can use 2-letter codes, so if space usage will be *less*. Or to put it in typical DBA lingo: "it depends". There is no one answer that works everywhere. — Charlieface, Jul 28 '22 at 22:44

allmhuran · Answer 1 · 2022-07-28T21:52:25.290

I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.

Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.

But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.

One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.

But a natural primary key can eliminate the need for a join in the first place.

Let's put all this together with an example:


create table Countries
(
   countryCode char(2) not null primary key clustered,
   countryName varchar(64) not null
);

insert Countries values
('AU', 'Australia'),
('FR', 'France');

create table TourLocations
(
   tourLocationName varchar(64) not null,
   tourLocationId int identity(1,1) unique clustered,
   countryCode char(2) not null foreign key references Countries(countryCode),
   primary key (countryCode, tourLocationName)
);

insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')

I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.

On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.

But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.

By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.

OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.

But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.

Suppose I gave the results of this query to my users:

select   countryCode, tourLocationName
from     TourLocations
order by 1, 2;

Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.

When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.

If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.

You make a lot of nice points, although I would point out that the "my key is actually *smaller* than an `INT`!" argument is easily undermined by pointing out that we can use a `SMALLINT`. For very small dimension tables with no easy "natural" key you can even use a `TINYINT`, although that would be inappropriate for countries (if you had some wag claiming that "there will never be more than 256 countries" and "we can use the negative values too", you could safely ignore them). Of course that doesn't mean using a country code is not appropriate here, on the contrary. — Jeroen Mostert, Jul 28 '22 at 18:27
@JeroenMostert Yes, certainly we could use a narrower `identity`. The point I wanted to make with that paragraph was only that whether or not an identity column is needed in order to have a narrow key is contingent, so don't just add them unthinkingly. — allmhuran, Jul 28 '22 at 18:36

Olivier Jacot-Descombes · Answer 2 · 2022-07-30T15:27:43.013

1

A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.

Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.

By meaningless I mean that apart from being the primary key it has no meaning to the users.

An int identity column has other advantages over a text field as primary key.

It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).

Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.

But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.

edited Jul 30 '22 at 15:27

answered Jul 28 '22 at 17:24

Olivier Jacot-Descombes

104,806
13
138
188

Makes Sense but what if Table record can only be uniquely defined by 2 fields. – S Nash Jul 28 '22 at 17:26
You can always declare a multi column primary key. This often the case in junction tables of a m : n relation where the two foreign keys are a primary key at the same time. – Olivier Jacot-Descombes Jul 28 '22 at 17:32
If a meaningless key is the only thing that must be unique in your table, how do your users distinguish between two rows that are identical for all values other than the meaningless key? To put that another way, if the meaningless key "identifies records", *but it's meaningless to users*, how do users identify the record they want when two records are identical other than on the meaningless key? – allmhuran Jul 28 '22 at 19:12
@allmhuran. It is not forbidden to display the primary key to users. Often I display it in a small greyed out box. This is also helpful for debugging. – Olivier Jacot-Descombes Jul 29 '22 at 11:31
@OlivierJacot-Descombes Suppose we have a `Customers{id primary key, firstName, lastName}`. We make the `id` visible in the UI. Then we insert `{1, John, Doe}` and `{2, John, Doe}`. One day the phone rings. They say "Hi, this is your customer, John Doe". Which John Doe is on the phone? – allmhuran Jul 29 '22 at 12:31
It is meaningful in that it helps identify an entry, but the number per se has no meaning. It is not the birth year or the age or zip code or the shoe size. It is just a number. It could be any other number and this would not change its meaning as identifier. Therefore there will be no need to edit this number in future. – Olivier Jacot-Descombes Jul 29 '22 at 12:39
@OlivierJacot-Descombes Indeed, but what about my scenario with John Doe? (I decided to reword the comment so you may have missed the change). – allmhuran Jul 29 '22 at 12:41
This is a perfect valid scenario where the id helps differentiating between the two John Does. But the numbers 1 and 2 do not have another meaning in this context. It is not a classification or prioritization. It is neutral. No one will ever say “John Doe 1 should be John Doe 7”. – Olivier Jacot-Descombes Jul 29 '22 at 12:49
@OlivierJacot-Descombes You're not seeing the thrust of my question. I am asking how the application user can determine whether the person on the phone is John Doe with id 1, or John Doe with id 2. Suppose the user says "I have two John Does in the system, which one are you?". The caller says "How should I know?". The user says "one is id 1 and the other is id 2". The caller replies "What are you talking about?". – allmhuran Jul 29 '22 at 16:57
Well, this no more a database problem. This is a domain problem. You need an another field, like the e-mail address. You would then add a unique index or unique constraint to this field but still keep the numeric primary key. This allows you (or the user) to change the e-mail easily. – Olivier Jacot-Descombes Jul 30 '22 at 11:54
Right... so we don't actually use the surrogate key to identify records, we need something from the domain to be able to do that. A natural key. You can certainly have a unique autogenerated value *in addition* to that, for clustering order. – allmhuran Jul 30 '22 at 14:53
Yes, but the standard, technical way of identifying a record is still the numeric id. I.e., if you want to delete a specific record you would write `DELETE FROM Users WHERE Id = 7`. And related records are linked by this id. To get a user comments you write `SELECT * FROM Comments WHERE UserId = 5` where `Comments.UserId` is linked to `Users.Id`. It's about identifying records (from a database point of view) versus identifying users (from a user interface point of view). – Olivier Jacot-Descombes Jul 30 '22 at 15:18

score 1 · Answer 3 · answered Jul 28 '22 at 17:27

1

Yes having a dedicated int is a good thing for PK use.

you may have multiple alternate keys, that's ok too.

two great reasons for it:

it is performant
it protects against key mutation ( editing a name etc. )

answered Jul 28 '22 at 17:27

Randy

16,480
1
37
55

The Impaler · Accepted Answer · 2022-07-28T17:59:45.120

You are talking about the difference between synthetic and natural keys.

In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.

In database design:

Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.

I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".

score 0 · Answer 5 · answered Jul 28 '22 at 17:51

This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.

However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.

There are of course MANY different views on this subject, and many advantages and disavantages.

Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.

The simple concpet?

You can build your tables with the PK and child tables with forighen keys.

At that point you setup your relationships between the tables.

Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.

But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?

You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!

The concept of data, even required fields, or even a column having to be unique ?

That has ZERO to do with a working relational data model.

And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.

So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.

So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.

I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.

Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!

So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?

You NEVER give that auto PK number any meaning from a user and applcation point of view.

In summary:

Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.

score 0 · Answer 6 · answered Jul 28 '22 at 17:56

Its better to use userid. User table is referenced by many other tables. The referenced table would contain the primary key of the user table as foreign key.

Its better to use userid since its integer value,

it takes less space than string values of username and
the searches by the database engine would be faster

user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)

Adding an artificial primary key versus using a unique field

6 Answers6

Linked