3

Here's my languages table definition with an auto-increment column (the DBMS is MySQL):

DROP TABLE IF EXISTS languages;
CREATE TABLE IF NOT EXISTS languages (
    language       VARCHAR(16) NOT NULL,
    PRIMARY KEY    (language)
) ENGINE=InnoDB;

Here's another version of it, but with the UNIQUE constraint applied to it:

DROP TABLE IF EXISTS languages;
CREATE TABLE IF NOT EXISTS languages (
    language_id    TINYINT     NOT NULL AUTO_INCREMENT,
    language       VARCHAR(16) NOT NULL,
    PRIMARY KEY    (language_id),
    UNIQUE         (language)
) ENGINE=InnoDB;

I'm kind of in two minds about which version is better to use. On one hand, the first definition seems to be the correct one according to database design theory, simply because there is no extra junk in it and the PRIMARY KEY constraint guarantees that there cannot be two rows with the same value, that is, there is no way that, for instance, the word 'English' can appear twice in the column, which is, of course, a good thing. But the problem with this is that the foreign key field in another table that references the language column has to store strings instead of ID numbers. That simply means the referencing table will be storing the whole thing in the column and if the application can provide a drop-down combobox list with pre-populated unique values in it, having the languages table seems to have no point. But, theoretically speaking, the first way is still more correct.

On the other hand, the second approach sounds more practical. For ensuring uniqueness we can use the UNIQUE constraint and we have integers instead of strings in the referencing columns which tend to take up less memory and, as far as I know, they are much faster during searching operations than strings.

Please, help me get this straight.

Mikhail
  • 806
  • 5
  • 17
  • 31
  • You already got this straight. IMHO it's a matter of preference in schema design. Usually you use one way or another consistently across your whole schema. Therefore IMHO the question is whether you ready to live with natural keys across your whole schema. – peterm Apr 10 '13 at 19:49
  • 1
    Imagine you make a typo when entering the langueage name (eg "Germain" instead of "German". Now in the first case you will have to fix that in **both** the languages table and the **tables that refer to it** via FKs. In the secanod case you only to make one correction. – wildplasser Apr 10 '13 at 19:52
  • With the first approach you can add the ON UPDATE CASCADE clause that will take care of it. – Mikhail Apr 10 '13 at 19:59
  • Yes, it could. but that would cascade into 50M rows in the person - table, where there is a "preferred_language" field with a foreign key to this table. – wildplasser Apr 10 '13 at 20:55
  • I don't get your point. Wouldn't it still cascade through the same 50 million rows in a table referencing the languages table had you used the second approach? – Mikhail Apr 10 '13 at 21:52
  • http://stackoverflow.com/questions/3747730/relational-database-design-question-surrogate-key-or-natural-key – Andrew Apr 10 '13 at 23:01

2 Answers2

1

I've asked a similar question here Should an SQL Dictionary table have an IDENTITY column

In that case I found not having the ID column was the correct decision as there would never be a case when I'd reference the data by anything other than the PK in code. That is there are no foreign keys reliant on that table.

In the case where you are looking up some arbitrary piece of data or referencing it as a foreign key I'd always argue in favor of using the id column as it will reduce the size of the database and is immediately identifiable that it is a foreign key to anyone with even the most basic database knowledge.

Community
  • 1
  • 1
Colton
  • 645
  • 4
  • 24
-1

The second version is more normalized. In database design theory there is the concept of the 1NF (first normal form), 2NF up to the 6NF. 1NF means you just have some kind of key. 6NF means your data is structured as clean as it could get. High normalization sounds nice but you pay a price:

  • more complex query and insert operations
  • decreasing performance due to all this complexity

In case of doubt I would always go for the less complex option. If you really need full optimization or normalization one day, you can change your schema on that day. Not sure how large your database is, but the refactoring will probably be a piece of cake if you do it carefully.

Philip
  • 3,470
  • 7
  • 29
  • 42
  • Probably calling refactoring a piece of cake? I don't know your response up that part is fairly accurate. – Colton Apr 10 '13 at 21:22
  • Well, it seems to me they are identical in terms of normalization. Don't you think so? You said the second version was more normalized. I'd like to see in what respect it is so. – Mikhail Apr 10 '13 at 21:32
  • @Sparksis: Adding a column is no magic. Neither is populating that column with incrementing numbers. Even if you are not fluent with MySQL, googling that takes 20 minutes. The updating of the code is also a no-brainer given the OP sounds like he doesn't use an ORM. In fact updating the code would be doing the extra effort he could do now. – Philip Apr 11 '13 at 11:55
  • @MikhailRybkin: As pointed out in the comments above, the key is (ab)used as data container. Eventhough the OP didn't clarify on that, it seems obvious to me that he will show the key in the UI. Example where this affects normalization: you realize that you should write Français instead of Francais. Having it in an extra column means you don't have to update columns in other tables. ... – Philip Apr 11 '13 at 12:00