0

Setup:

I was trying to understand the difference between identifying and non-identifying relationships when I found this great article on stackexchange. What's the difference between identifying and non-identifying relationships?

After reading a few comments it brought another question to mind about a problem I have been having.


Question:

Should I use multi-column primary keys on every child table and what are the advantages/disadvantages to doing so?

To better illustrate my question I have created an example below. I also included the comments that caused me to ask this question.


Example:

In my situation, I know the building_id and I need to get bed.data.

#1 - My current DB structure:

TABLE { FIELDS }
-----------------------------------------------------------------------
building { id, data } 
floor { id, building_id, data }
room {id, floor_id, data }
bed {id, room_id, data }

This type of table structure would require me to use a few joins to get the data I need. Not a big deal but kind of a pain since I run into this situation a lot.

#2 - My interpretation of Bill Karwin's suggested DB structure (see article comments below):

TABLE { FIELDS }
-----------------------------------------------------------------------
building { id, data } 
floor { id, building_id, data }
room {id, building_id, floor_id, data }
bed {id, building_id, floor_id, room_id, data }

This table structure seems to eliminate the need for joins in my situation. So what are the disadvantages to this table structure? I really like the idea of not doing so many join statements.


Comments From Article:

What's the difference between identifying and non-identifying relationships?

@hobodave: It's the "convention over configuration" argument. Some schools of thought are that every table should define its primary key for a single-column pseudokey named id that auto-generates its values. Application frameworks like Rails have popularized this as a default. They treat natural keys and multi-column keys as divergent from their conventions, needed when using "legacy" databases. Many other frameworks have followed this lead. – Bill Karwin Mar 10 '10 at 23:06

It seems like "properly" constructing identifying relationships would lead to obnoxiously huge primary keys. e.g. Building has Floor has Room has Bed. The PK for Bed would be (bed_id, floor_id, room_id, building_id). It seem's strange that I've never seen this in practice, nor heard it suggested as a way to do anything. That's a lot of redundant data in the PK. – hobodave Mar 10 '10 at 23:34

@hobodave: I have seen multi-column primary keys that are even larger. But I take your point. Consider that multi-column primary keys convey more information; you can query the Beds table for all beds in a specific building without doing any joins. – Bill Karwin Mar 11 '10 at 1:00

Community
  • 1
  • 1
zechdc
  • 3,374
  • 9
  • 40
  • 52
  • One thing to consider. Is the building ID field something that could change? (In which case, your update code would need to update all tables using the the building_id field). If Building ID is an increment or identity field, it shouldn't change, so the only draw back is keeping extra fields makes tables larger. – Sparky Sep 08 '11 at 20:56
  • I try to keep both FK columns named the same, so the table building would be: `building_id, data` and floor table would be: `floor_id, building_id, data`. When you work with large systems developed by many people over a long time, it is way easier to read and write queries where you join on the same names. Not to mention your results sets wont contain several `id` columns that you have to alias each time. – KM. Sep 08 '11 at 20:59
  • 1
    Maybe we can get @Bill Karwin to join us and point out whether we've misunderstood something. – Mike Sherrill 'Cat Recall' Sep 09 '11 at 01:15
  • That would be nice. Is there a way to do that? – zechdc Sep 09 '11 at 01:42

3 Answers3

3

This data is normalized

TABLE { FIELDS }
-----------------------------------------------------------------------
building { id, data } 
floor { id, building_id, data }
room {id, floor_id, data }
bed {id, room_id, data }

This table is not (bad idea)

TABLE { FIELDS }
-----------------------------------------------------------------------
building { id, data } 
floor { id, building_id, data }
room {id, building_id, floor_id, data }
bed {id, building_id, floor_id, room_id, data }
  1. In the first (good) table you do not have unneeded duplicated data.
  2. Inserts in the first table will be much faster.
  3. The first tables will fit more easily in memory, speeding up your queries.
  4. InnoDB is optimized with model A in mind, not with model B.
  5. The latter (bad) table has duplicated data, if that gets out of sync, you will have a mess. DB A cannot is much harder to get out of sync, because the data is only listed once.
  6. If I want to combine data from the building, floor, room and bed I will need to combine all four tables in model A as well as model B, how are you saving time here.
  7. InnoDB stores indexed data in its own file, if you select only indexes, the tables themselves will never be accessed. So why are you duplicating the indexes? MySQL will never need to read the main table anyway.
  8. InnoDB stores the PK in each an every secondary index, with a composite and thus long PK, you are slowing down every select that uses an index and balooning the filesize; for no gain what so ever.
  9. Do you have serious speed problem? If not, you are you denormalizing your tables?
  10. Don't even think about using MyISAM which suffers less from these issues, it is not optimized for multi-join databases and does not support referential intregrity or transactions and is a poor match for this workload.
  11. When using a composite key you can only ever use the rightmost-part of the key, i.e. you cannot use floor_id in table bed other than using id+building_id+floor_id, This means that you may have to use much more key-space than needed in Model A. Either that or you need to add an extra index (which will drag around a full copy of the PK).

In short
I see absolutly zero benefit and a whole lot of drawbacks in Model B, never use it!

Johan
  • 74,508
  • 24
  • 191
  • 319
  • Ok, makes sense. Say we add a new table to the mix: "User { id }". Would it be reasonable to follow model A like you suggested but still include "user_id" in almost every table for quick access to verify that a user has the right to access a given "bed", "floor", or "building"? I know this is kinda a broad question, but in my live system I have a mixture of model A and B. Namely, I have a user table whos ID can be found in almost every table for exactly the reason I described. – zechdc Sep 08 '11 at 21:58
  • 2
    @zechdc, I would advice you NOT to denormalize UNTIL slowness sets in AND you have run out of options, so without further info to the contrary I would say `stick to model A`. The idea you descibe sounds very wasteful, what if 2 users have access to the same row, do you double the row? – Johan Sep 08 '11 at 23:27
  • thanks! I think you just answered another question I had... how do you allow multiple users to access the same data? Thanks! – zechdc Sep 09 '11 at 00:00
  • 1
    The OP's model #2 is incorrect. See my interpretation of Bill Karwin's other SO answer below. – Mike Sherrill 'Cat Recall' Sep 09 '11 at 01:07
3

I think it's pretty unlikely that your #2 is what Bill Karwin meant. Usually, "id" implies an automatic numeric sequence. I think it's more likely he meant something along these lines. Columns that make up primary keys are between asterisks.

TABLE    { COLUMNS }
-----------------------------------------------------------------------
building { *building_id*, other columns } 
floor    { *building_id, floor_num*, other columns }
room     { *building_id, floor_num, room_num*, other columns }
bed      { *building_id, floor_num, room_num, bed_num* (?), other columns }

I'm not sure what other columns you might have for "bed", though. Twin, Full, Queen, King? That could make sense. If that's the case, then this table

bed      { *building_id, floor_num, room_num, bed_num*, bed_size }

is far from "denormalized". In fact, it's in 5NF.

If you test the performance of these two schemas, you'll probably find that this one runs rings around your #1 most of the time. In the batch of queries I ran, it's about 30 times faster.

Johan
  • 74,508
  • 24
  • 191
  • 319
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • That makes sense. I did misinterpret what he was saying but luckily I got the answer I was looking for. Thanks for your help. – zechdc Sep 09 '11 at 00:52
  • 2
    It would probably make more sense--and be better practice for you--if you tested normalization and performance before you accept an answer. Wikipedia's articles on normalization and the normal forms are not bad. – Mike Sherrill 'Cat Recall' Sep 09 '11 at 01:10
  • I have been practicing normalization for quite some time, sometimes my brain just shuts down and I ask silly questions like this. However, I have gleaned some great information from both your answers that I would have otherwise not received. They both helped me arrive at some final decisions on my current project. I voted yours up but I do prefer Johan's answer because it answered a question I was unable to articulate. You are correct, I did misunderstand Bill's response. Thanks for clearing that up. – zechdc Sep 09 '11 at 01:49
  • This assumes that the *_nums repeat, so that you can keep the keys small. – Johan Sep 09 '11 at 10:11
  • 2
    @Johan: It assumes you identify a bed by building, floor, room, and bed number, which was given in the problem, but which follows how hotels and motels work. It doesn't matter whether any of those numbers repeat. (It doesn't matter if you have only one building that has only one floor, only one room, and only one bed. Although you probably wouldn't need a database in that case.) – Mike Sherrill 'Cat Recall' Sep 09 '11 at 11:49
0

The first tables structure is normalized,classical structure. But unfortunately this one is not applicable for the big project. Because if your table building contains many data rows e.g. million depending on which cities or countries you use your join will be very slow. So in real projects denormalized tables are used which contain all aggregated info. You can work with such tables directly or use standalone servers like sphinx for searching data. Regarding primary key on three fields I think in this case this one is redundant. Because

  1. If you use innodb this key will be added to all secondary keys in this table.
  2. If you use interface for managing beds it will be convenient to use one field id for working with a specific rows than with three fields.
  3. If you want to guarantee uniqueness of row you can use UNIQUE KEY on these 3 fields.
Andrej
  • 7,474
  • 1
  • 19
  • 21
  • 2
    Strange, people with terrabyte sized databases don't denormalize to get rid of joins. Of course they have designers who know what they are doing and do proper indexing. – HLGEM Sep 08 '11 at 21:13
  • 1
    @HLGEM it seems that you didn't take part in big project which search by hierarchical data. Simple example you have addresses objects like cities,districts, streets, buildings and so on and use mysql for this one(not sure that this one optimal solution) and you provide to users suggest box for quick searching inside objects. So try to implement this one using joining tables :) For example, we use aggregated table of all address objects and put them into sphinx for quick search. – Andrej Sep 08 '11 at 21:20