29

Traditionally I have always used an ID column in SQL (mostly mysql and postgresql).

However I am wondering if it is really necessary if the rest of the columns in each row make in unique. In my latest project I have the "ID" column set as my primary key, however I never call it or use it in any way, as the data in the row makes it unique and is much more useful for me.

So, if every row in a SQL table is unique, does it need a primary key ID table, and are there ant performance changes with or without one?

Thanks!

EDIT/Additional info: The specific example that made me ask this question is a table I am using for a many-to-many-to-many-to-many table (if we still call it that at that point) it has 4 columns (plus ID) each of which represents an ID of an external table, and each row will always be numeric and unique. only one of the columns is allowed to be null.

I understand that for normal tables an ID primary key column is a VERY good thing to have. But I get the feeling on this particular table it just wastes space and slows down adding new rows.

dev.e.loper
  • 35,446
  • 76
  • 161
  • 247
lanrat
  • 4,344
  • 10
  • 35
  • 41
  • 1
    All the arguments have been well covered by other "natural key vs surrogate" questions and yes, there are always arguments! – onedaywhen Jun 27 '11 at 06:02

8 Answers8

30

If you really do have some pre-existing column in your data set that already does uniquely identify your row - then no, there's no need for an extra ID column. The primary key however must be unique (in ALL circumstances) and cannot be empty (must be NOT NULL).

In my 20+ years of experience in database design, however, this is almost never truly the case. Most "natural" ID's that appear to be unique aren't - ultimately. US Social Security Numbers aren't guaranteed to be unique, and most other "natural" keys end up being almost unique - and that's just not good enough for a database system.

So if you really do have a proper, unique key in your data already - use it! But most of the time, it's easier and more convenient to have just a single surrogate ID that you can guarantee will be unique over all rows.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • I agree, but this begs the question, why not define a UNIQUE key on a column? Won't this guarantee its uniqueness? – yitwail Jun 27 '11 at 04:59
  • @yitwail: sure - but that doesn't give you the "primary key" aspect yet... (values could be NULL!). UNIQUE as a constraint is helpful, sure. But any table needs a primary key - and you can only have one primary key, while you can have multiple uniqueness constraints. – marc_s Jun 27 '11 at 05:01
  • another good point. UNIQUE key and column defined as NOT NULL then. :) but I know, this restricts the table from having one NULL value in the column. – yitwail Jun 27 '11 at 05:09
  • @yitwail: yes, and once you have UNIQUE and NOT NULL, you're really talking about the PRIMARY KEY already :-) – marc_s Jun 27 '11 at 05:09
  • I edited my question with a bit more information on this particular table. The "uniqueness" of each row is the combination of at least 3 columns, so it looks like an "ID" column is still a good idea? – lanrat Jun 27 '11 at 05:39
  • @mrlanrat: yes, I tend to try to avoid compound primary keys (made up of more than 1 column) - they tend to get very messy and not practical to use when you have to join from a child table (you need to specify ALL columns in your JOIN's - get annoying very quickly!). In the specific case of many-to-many link tables, that might be OK and you won't need an extra ID. But for regular tables, I'd always prefer a surrogate ID over a compound index. – marc_s Jun 27 '11 at 05:49
  • You seem to be saying that if you had a `NOT NULL` SSN in a 'person' table you wouldn't make it `UNIQUE` just because you've heard of edge cases where duplicates have arisen. What is the likelihood of a clash in SSN being a real life problem for most enterprises' databases? Extremely unlikely, I suspect. You seem to be describing artificial identifiers rather than 'surrogates', BTW. – onedaywhen Jun 27 '11 at 06:01
  • 1
    @onedaywhen: For someone like IBM or insurance, quite likely. Just search for SSN and primary key – gbn Jun 27 '11 at 06:50
  • 1
    One thing that it seems *is* guaranteed is that in any discussion about keys and integrity someone will wheel out the straw man of SSNs and begin dancing around it rather than thinking about the real issues. If marc has really been creating databases for 20 years without natural keys then I wonder how he ever expects his users identify data in his tables (or hold on to their sanity) - considering all the duplicate data they must find themselves looking at. – nvogel Jun 27 '11 at 12:22
  • 2
    @dportas: why *all the duplicate data* ?? If something (a column or set of columns) is unique, I typically put a unique constraint on it. I'm not stupid .... **BUT:** I hate to have queries from a child table to a parent table that require a gazillion JOIN conditions, just because someone decided to use a compound primary key to *avoid* the complication of a nice, simple surrogate ID as key..... – marc_s Jun 27 '11 at 12:28
  • 1
    @marc, that's what I thought you might say. So just to be clear, like me you ARE in favour of natural keys as well as surrogates and you DO regularly implement natural keys in your databases. Anyone reading your original answer will get the opposite impression. – nvogel Jun 28 '11 at 06:41
  • @gbn: again, you refer to edge cases (is the OP IBM?! :) – onedaywhen Jun 28 '11 at 08:02
  • @dportas: sigh indeed. That's why I voted to close as "Not constructive... this question will likely solicit opinion, debate, arguments, polling, or extended discussion" -- and this is comment number 13 for just one answer ;) – onedaywhen Jun 28 '11 at 08:04
  • @marc_s "But any table needs a primary key", do you mind explaining that? I'm not so sure since I can have some tables that I only need them to check things like brute force login, in this case I'd only need to count the rows matching two non-unique columns: the user id and the time, or a table to store logs and show them in a webapp or whatever, why would I need a primary/unique key in this case? Thanks, it is really a legit question, I'm genuinely interested in knowing this. – Chazy Chaz Jun 27 '18 at 05:17
  • 1
    @ChazyChaz: **any** table needs to have a means of **uniquely and reliably** identify each and every single row, in order to e.g. delete that row, establish a link between this row and other rows etc. - this is what a **primary key** is needed for. Yes, there are **a select few** edge cases where you do not necessarily need a primary key - but in general, in at least 90% of the cases, a "normal" data table just ought to have a primary key defined. – marc_s Jun 27 '18 at 05:37
7

Don't confuse the logical model with the implementation.

The logical model shows a candidate key (all columns) which could makes your primary key.

Great. However...

In practice, having a multi column primary key has downsides: it's wide, not good when clustered etc. There is plenty of information out there and in the "related" questions list on the right

So, you'd typically

  • add a surrogate key (ID column)
  • add a unique constraint to keep the other columns unique
  • the ID column will be the clustered key (can be only one per table)
  • You can make either key the primary key now

The main exception is link or many-to-many tables that link 2 ID columns: a surrogate isn't needed (unless you have a braindead ORM)

Edit, a link: "What should I choose for my primary key?"

Edit2

For many-many tables: SQL: Do you need an auto-incremental primary key for Many-Many tables?

Community
  • 1
  • 1
gbn
  • 422,506
  • 82
  • 585
  • 676
4

Yes, you could have many attributes (values) in a record (row) that you could use to make a record unique. This would be called a composite primary key.

However it will be much slower in general because the construction of the primary index will be much more expensive. The primary index is used by relational database management systems (RDBMS) not only to determine uniqueness, but also in how they order and structure records on disk.

A simple primary key of one incrementing value is generally the most performant and the easiest solution for the RDBMS to manage.

mlangsworth
  • 383
  • 3
  • 9
3

You should have one column in every table that is unique.

EDITED...

This is one of the fundamentals of database table design. It's the row identifier - the identifier identifies which row(s) are being acted upon (updated/deleted etc). Relying on column combinations that are "unique", eg (first_name, last_name, city), as your key can quickly lead to problems when two John Smiths exist, or worse when John Smith moves city and you get a collision.

In most cases, it's best to use a an artificial key that's guaranteed to be unique - like an auto increment integer. That's why they are so popular - they're needed. Commonly, the key column is simply called id, or sometimes <tablename>_id. (I prefer id)

If natural data is available that is unique and present for every row (perhaps retinal scan data for people), you can use that, but all-to-often, such data isn't available for every row.

Ideally, you should have only one unique column. That is, there should only be one key.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 2
    I trust you, but do you mind explaining your reasoning a bit? – lanrat Jun 27 '11 at 04:50
  • 1
    Fundamentals of database design states "a key to uniquely identify each row". Not "one column must be unique". Do you have an authoritative reference for your statement? – gbn Jun 27 '11 at 05:25
  • gee, fussy. did you actually read the rest of it? See the bit where I said "It's the row identifier". See it? Did ya? huh? – Bohemian Jun 27 '11 at 09:54
  • -1 for the "fundamentals" bit - that's just wrong as gbn has already said. -10 for "there should only be one key"!! That's appalling advice from a data integrity perspective. – nvogel Jun 27 '11 at 09:57
  • Far from appalling, it's best-practice industry standard advice - see http://en.wikipedia.org/wiki/Second_normal_form - "a 1NF table is in 2NF if and only if, given any candidate key K and any attribute A that is not a constituent of a candidate key, A depends upon the whole of K rather than just a part of it." This means there's only one key! With two keys, an attribute will depend on a column *other* than the key, violating 2NF – Bohemian Jun 28 '11 at 01:32
  • @Bohemian, you have misunderstood what 2NF means. In almost any good textbook you can find examples of relations with more than one key that are also in 2NF. I don't usually like to comment on anything found on Witless-pedia but in this case the quotation you mentioned: "given any candidate key K ... A depends upon the whole of K" certainly does not exclude the possibility of there being more than one K. If A depends on the whole of K then it can still be dependent on other keys as well. – nvogel Jun 29 '11 at 05:28
3

Using IDs to key tables means you can change the content as needed without having to repoint things

Ex. if every row points to a unique user, what would happen if he/she changed his name to let say John Blblblbe which had already been in db? And then again, what would happen if you software wants to pick up John Blblblbe's details, whose details would be picked up? the old John's or the one ho has changed his name? Well if answer for bot questions is 'nothing special gonna happen' then, yep, you don't really need "ID" column :]

Important:

Also, having a numeric ID column with numbers is much more faster when you're looking for an exact row even when the table hasn't got any indexing keys or have more than one unique

WooDzu
  • 4,771
  • 6
  • 31
  • 61
  • 2
    That isn't the reason to have one... this is a consequence of having ID. A primary key shouldn't change and OP has a candidate PK, so this isn't why you'd have one. – gbn Jun 27 '11 at 04:55
  • 1
    After an update: without an index, access to the row will be slower because each row will be wider. The whole table needs read too. This will offset any comparison speed gains per row. – gbn Jun 27 '11 at 05:23
  • What I meant was what queries like **WHERE not_indexed_column = 175**, are faster than **WHERE not_indexed_column = 'blablabla'** – WooDzu Jun 27 '11 at 05:32
  • @WooDzu: and that's exactly what I meant in my comment. This is wrong. – gbn Jun 27 '11 at 05:35
  • how comparison by numerical values can not be faster than comparison by chars, even though db read all table it only tests the one column? where can I be wrong? – WooDzu Jun 27 '11 at 05:46
  • 1
    @WooDzu: reading the whole table takes longer because it is bigger. – gbn Jun 27 '11 at 06:48
1

If you are sure that any other column is going to have unique data for every row and isn't going to have NULL at any time then there is no need of separate ID column to distinguish each row from others, you can make that existing column primary key for your table.

love Computer science
  • 1,824
  • 4
  • 20
  • 39
0

An ID can be more meaningful, for an example an employee id can represent from which department he is, year of he join and so on. Apart from that RDBMS supports lots operations with ID's.

Abhishek Gurjar
  • 7,426
  • 10
  • 37
  • 45
Afz Al
  • 11
  • 3
0

No, single-attribute keys are not essential and nor are surrogate keys. Keys should have as many attributes as are necessary for data integrity: to ensure that uniqueness is maintained, to represent accurately the universe of discourse and to allow users to identify the data of interest to them. If you have already identified a suitable key and if you don't find any real need to create another one then it would make no sense to add redundant attributes and indexes to your table.

nvogel
  • 24,981
  • 1
  • 44
  • 82
  • Good luck with databases where every table has compound keys with 4, 5, 10 attributes.... I wouldn't want to see (or worse: **maintain!**) those queries! .... – marc_s Jun 27 '11 at 12:25