Why is it not good to have a primary key on a join table?

Question

I was watching a screencast where the author said it is not good to have a primary key on a join table but didn't explain why.

The join table in the example had two columns defined in a Rails migration and the author added an index to each of the columns but no primary key.

Why is it not good to have a primary key in this example?

create_table :categories_posts, :id => false do |t|
  t.column :category_id, :integer, :null => false
  t.column :post_id, :integer, :null => false
end
add_index :categories_posts, :category_id
add_index :categories_posts, :post_id

EDIT: As I mentioned to Cletus, I can understand the potential usefulness of an auto number field as a primary key even for a join table. However in the example I listed above, the author explicitly avoids creating an auto number field with the syntax ":id => false" in the "create table" statement. Normally Rails would automatically add an auto-number id field to a table created in a migration like this and this would become the primary key. But for this join table, the author specifically prevented it. I wasn't sure why he decided to follow this approach.

To editors: it may be important to emphasize the context of this question. It is more often than not bad form to NOT have a primary key. — Mark Canlas, May 19 '09 at 01:12
Good article is Codd's 1970 paper http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf — Matt Rogish, May 19 '09 at 14:37
One thing to consider, that paper was written in 1970, when I/O and data storage was relatively much, much more expensive. In modern times, however, the costs of adding an additional primary key column is almost always minuscule. I'd love to see someone present a real world case where the extra column creates a measurable problem. — DGM, Jan 02 '18 at 22:07

Matt Rogish · Accepted Answer · 2009-05-19T20:16:37.747

Some notes:

The combination of category_id and post_id is unique in of itself, so an additional ID column is redundant and wasteful
The phrase "not good to have a primary key" is incorrect in the screencast. You still have a Primary Key -- it is just made up of the two columns (e.g. CREATE TABLE foo( cid, pid, PRIMARY KEY( cid, pid ) ). For people who are used to tacking on ID values everywhere this may seem odd but in relational theory it is quite correct and natural; the screencast author would better have said it is "not good to have an implicit integer attribute called 'ID' as the primary key".
It is redundant to have the extra column because you will place a unique index on the combination of category_id and post_id anyway to ensure no duplicate rows are inserted
Finally, although common nomenclature is to call it a "composite key" this is also redundant. The term "key" in relational theory is actually the set of zero or more attributes that uniquely identify the row, so it is fine to say that the primary key is category_id, post_id
Place the MOST SELECTIVE column FIRST in the primary key declaration. A discussion of the construction of b(+/*) trees is out of the scope of this answer ( for some lower-level discussion see: http://www.akadia.com/services/ora_index_selectivity.html ) but in your case, you'd probably want it on post_id, category_id since post_id will show up less often in the table and thus make the index more useful. Of course, since the table is so small and the index will be, essentially, the data rows, this is not very important. It would be in broader cases where the table is wider.

This explanation hit all of the points of unclarity that I was getting stuck on. Thanks. "the screencast author would better have said it is 'not good to have an implicit integer attribute called 'ID' as the primary key'" And thanks for spelling this out: "It is redundant to have the extra column because you will place a unique index on the combination of category_id and post_id anyway to ensure no duplicate rows are inserted" — pez_dispenser, May 19 '09 at 02:19
It is not correct to say zero or more (basic set theory aside) see: http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf — Matt Rogish, May 19 '09 at 14:36
OK it IS just fine in relational theory (empty set as a key) but NOT in SQL — Matt Rogish, May 19 '09 at 20:20
A 6th reason would be that having a synthetic primary key will cause a (slight) performance penalty for writes to the table. This is because you'll have to generate the next key and update another index with each insert into the table, but since it's highly unlikely you'll ever reference the synthetic key you won't be gaining any speed-up to reads for having the index. To gain nothing for something is a bad trade-off. — neuronaut, Jul 06 '15 at 22:34

score 4 · Answer 2 · answered May 19 '09 at 01:42

4

It is a bad idea not to have a primary key on any table, period (if the DBMS is a relational DBMS - or an SQL DBMS). Primary keys are a crucial part of the integrity of your database.

I suppose if you don't mind your database being inaccurate and providing incorrect answers every so often, then you could do without...but most people want accurate answers from their DBMS and for such people, primary keys are crucial.

answered May 19 '09 at 01:42

Jonathan Leffler

730,956
141
904
1,278

1

What kind of inaccurate data would a primary key prevent in this join table consisting of two columns? I don't doubt that you may be right but I just don't understand what kind of invalid data might be possible here. Could you give me an example? – pez_dispenser May 19 '09 at 01:49
Duplicate rows is what I had in mind. You can find other examples, too, especially with liberally strewn identity columns and no unique constraint on the natural keys. If the table has other columns than just the key columns, you can actually have the database containing contradictions - and if you know classic logic, you know that arguing from a contradiction leads to fallacies. – Jonathan Leffler May 19 '09 at 02:40

score 3 · Answer 3 · answered May 19 '09 at 01:05

3

A DBA would tell you that the primary key in this case is actually the combination of the two FK columns. Since Rails/ActiveRecord doesn't play nice with composite PKs (by default, at least), that may be the reason.

answered May 19 '09 at 01:05

Hank Gay

70,339
36
160
222

When you say "by default" you mean there is a way to get Rails to play nice but it's complicated to do? – pez_dispenser May 19 '09 at 01:09
q-tip: Take a look at has_and_belongs_to_many (along with has_many_through) http://blog.hasmanythrough.com/2007/1/15/basic-rails-association-cardinality - it will magically (sorta) take advantage of the join table – Matt Rogish May 19 '09 at 03:52
@po I've *heard* that there are plugins to make AR work with natural keys (including composite keys) instead of surrogate keys. I don't know whether they work or not. – Hank Gay May 19 '09 at 08:06

score 3 · Answer 4 · answered May 19 '09 at 01:11

3

The combination of foreign keys can be a primary key (called a composite primary key). Personally I favour using a technical primary key instead of that (auto number field, sequence, etc). Why? Well, it makes it much easier to identify the record, which you may need to do if you're going to delete it.

Think about it: if you're going to present a Webpage of all the linkages, having a primary key to identify the record makes it much easier.

answered May 19 '09 at 01:11

cletus

616,129
168
910
942

I see what you mean about the usefulness of an auto number field as a primary key even for a join table. However in the example I listed above, the author explicitly avoids creating an auto number field with the syntax ":id => false" in the create table statement. Normally Rails would automatically add an auto-number id field to a table created in a migration like this and this would become the primary key. But for this join table, the author specifically prevented it. I wasn't sure why he decided to follow this approach. – pez_dispenser May 19 '09 at 01:23

score 3 · Answer 5 · answered May 19 '09 at 01:40

3

Basically because there's no need for it. The combination of the two foreign key field adequately uniquely identifies any row.

But that merely says why it's not a Good Idea.... but why would it be a Bad Idea?

Consider the overhead adding a identity column would add. The table would take up 50% more disk space. Worse is the index situation. With a identity field, you have to maintain the identity count, plus a second index. You'll be tripling the disk space and tripling the work the needs to be performed on every insert. With the only advantage being a slightly shorter WHERE clause in a DELETE command.

On the other hand, If the composite key fields are the entire table, then the index can be the table.

answered May 19 '09 at 01:40

James Curran

101,701
37
181
258

1

All primary keys are not identity columns! – Jonathan Leffler May 19 '09 at 01:42
Sorry. I was following you up until the last sentence. I didn't understand what you meant by: "On the other hand, If the composite key fields are the entire table, then the index can be the table." – pez_dispenser May 19 '09 at 01:45
q-tip: There's no advantage to adding an index that is equal to the entire table, like would happen if your primary key was a composite key made up of every column in the table. When we do this, the table serves as the index. I hope that's useful. Sorry if it isn't. – Lucas Wilson-Richter May 19 '09 at 02:10
2

Some DBMS support the concept of index-only tables. These are useful where, as James said, all the columns in the table make up the primary key (for example, a joining table) and when there is no need for an index on the second column (in the index-only table). – Jonathan Leffler May 19 '09 at 02:43

score 2 · Answer 6 · answered May 19 '09 at 10:34

2

Placing the most selective column first should only be relevant in the INDEX declaration. In the KEY declaration, it should not matter (because, as has been correctly pointed out, the KEY is a SET, and inside a set, order doesn't matter - the set {a1,a2} is the same set as {a2,a1}).

If a DBMS product is such that ordering of attributes inside a KEY declaration makes a difference, then that DBMS product is guilty of not properly distinguishing between the logical design of a database (the part where you do the KEY declaration) and the physical design of the database (the part where you do the INDEX declaration).

answered May 19 '09 at 10:34

Most, if not all, DBMS products I've used (MySQL, Sybase ASE, SQL Server, Oracle) implicitly create a unique index on the PRIMARY KEY declaration in the order you specify. Yes, it is violating logical/physical independence but is the only way to go (unless you create the table w/o a primary key, create a unique index, then 'mark' the primary key somehow) – Matt Rogish May 19 '09 at 14:30
Also, SQL by definition violates a TON of the Relational Model, this included :D – Matt Rogish May 19 '09 at 14:30

score 2 · Answer 7 · answered May 19 '09 at 17:43

I wanted to comment on the following comment : "It is not correct to say zero or more".

I wanted to remark that the text to which this comment was added simply did not contain the text "zero or more", so the author of the comment I wanted to comment on was criticizing someone else for something that hadn't been said.

I also wanted to comment that it is not correct to say that it is not correct say "zero or more". Relational theory as commonly known today among the few people who still bother to study the details of that theory, actually REQUIRES the possibility of a key with no attributes.

But when I pressed the button "comment", the system responded to me that commenting requires a reputation score of 50 (or some such).

A sad illustration of how the world seems to have forgotten that science is not democracy, and that in science, the truth is not determined by whoever happens to be the majority, nor by whoever happens to have "enough reputation".

I see - re-reading Date's Database Dictionary says that an empty PK is used to constrain relvars to a single row. OK, I buy that -- it's not explicit in Codd's writing but other than that limited edge case, when would someone use an empty key? — Matt Rogish, May 19 '09 at 20:43

score 1 · Answer 8 · answered May 19 '09 at 01:57

Pros of having a single PK

Uniquely identifies a row with a single value
Makes it easy to reference the relationship from elsewhere if needed
Some tools want you to have a single integer value pk

Cons of having a single PK

Uses more disk space
Need 3 indexes rather than 1
Without a unique constraint you could end up with multiple rows for the same relationship

Notes

You need to define a unique constraint if you want to avoid duplicates
In my opinion don't use the single pk if you're table is going to be huge, otherwise trade off some disk space for the convenience. Yes it's wasteful, but who cares about a few MB on disk in real world applications.

"Need 3 indexes rather than 1" - I assume you mean the primary key on an auto number field plus the two other indexes in my example above (not two additional indexes not listed in my example). "Without a unique constraint you could end up with multiple rows for the same relationship" -> So in that case the PK would have to reference the two columns in this join table. In other words, a primary key that simply consists of an auto number field would not work. Hope I'm understanding. — pez_dispenser, May 19 '09 at 02:10

Why is it not good to have a primary key on a join table?

8 Answers8