5

When storing a user's religion in a "User Table", so that if you look down a column you would see "Christian" many times, "Muslim" many times, etc considered a failure of a normal form? Which form?

The way I see it:

  • 1nf: There are no repeating columns.

  • 2nf: There is no concatenated primary key, so this does not apply.

  • 3nf: There is no dependency on a nonkey attribute.

Storing user religion this way does not seem to fail any normal form, however it seems very inefficient. Comments?

philipxy
  • 14,867
  • 6
  • 39
  • 83
user
  • 533
  • 7
  • 16

5 Answers5

6

Your design supports all normal forms. It's fine that your attribute has a string value. The size of the data type is irrelevant for normalization.

The goal of normalization is not physical storage efficiency -- the goal is to prevent anomalies. And to support logical efficiency, i.e. store a given fact only once. In this case, the fact that the user on a given row is Christian.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • How are data modification anomalies actually prevented without some constraint (CHECK or FOREIGN KEY) in this model? (Ignoring any client code enforcement via an ORM or such) – gbn Dec 30 '11 at 05:40
  • That is a great point you bring up. A CHECK could be put in but there would be no source of data modification. You could ask the same question in regards to any atomic data value. I don't believe preventing data modification anomalies is always necessary. Feel free to disagree. – user Dec 30 '11 at 05:50
  • @gbn, can you describe an example of the kind of anomaly you would want to be prevented? If you want to restrict the set of valid strings, you can use a constraint as you say. But rules of normalization still don't require using an integer pseudokey; you can put a constraint directly on a column with a string data type. – Bill Karwin Dec 30 '11 at 06:06
  • @BillKarwin: "Christain" vs "Christian" for example, or adding "Jedi" to the list of possible religions. As it stands, update/insert/delete anomalies are not prevented currently with the information given. I never mentioned "thou shalt always have a surrogate key" see [One](http://stackoverflow.com/a/790359/27535) [Two](http://stackoverflow.com/a/1208139/27535) – gbn Dec 30 '11 at 06:14
  • @gbn: Those are examples of choosing an unwanted value, not an *anomaly*. An anomaly example would be: "user1 goes to church X, and X is a Christian church, whereas user2 also goes to the same church X, which is a Jedi church." – Bill Karwin Dec 30 '11 at 08:04
  • @BillKarwin: Would an efficent way to store these types of user attributes be to simply list them in a user table? I was thinking an alternate, effective way to store attributes such as Religion, Eye Color, Hair Color, etc would be to make an Attribute table. Does this seem like a more effective approach? – user Dec 30 '11 at 10:20
  • @user1122200: In a relational design, you should store each attribute in its own column in the user table, or in a child table if it's a multi-valued attribute. If by an "Attribute table" you mean [Entity-Attribute-Value](http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model), then no, [I never recommend that design](http://karwin.blogspot.com/2009/05/eav-fail.html). It's both inefficient and non-relational. – Bill Karwin Dec 30 '11 at 17:01
  • @BillKarwin: Great. One problem I see with this method is in adding a new user attribute after deployment. The new attribute column would store a lot of blank spaces if users have not entered in the values for that attribute. However in using the EAV or similar "Attribute table" model, that would not be a problem. Your thoughts? – user Dec 30 '11 at 20:23
  • A NULL is not the same as a string of blank spaces, and NULL should be stored efficiently by any reasonably modern RDBMS. InnoDB for example uses no storage for a NULL, it just skips that column. – Bill Karwin Dec 30 '11 at 20:26
  • Ah, how great if StackOverflow had a whiteboard where I could draw what I mean. Ok, if that column is only partially filled, there would be multiple NULLs as I understand what you're saying. Any RDBMS would be fine with that. On that same note then, what prevents database design that fails 1NF? If a user say has hobby1, hobby2, hobby3, etc, why not just throw in hobby1...hobby10 for every user, knowing that some users will have 3 hobbies, some may have 5, etc? It seems to me that the unfilled cells would be inefficient. – user Dec 30 '11 at 20:34
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/6267/discussion-between-user1122200-and-bill-karwin) – user Dec 30 '11 at 20:43
4

The principle disadvantage to storing the column in that manner is in storage space as the number of rows scales up.

Rather than a character column, you could use an ENUM() if you have a fixed set of choices that will rarely, if ever, change, and still avoid creating an additional table of religion options to which this one has a foreign key. However, if the choices will be fluid, normalization rules would prefer that the choices be placed into their own table with a foreign key in your user table.

There are other advantages besides storage space to keeping them in another table. Modifying them is a snap. To change Christian to Christianity, you can make a single change in the religions table, rather than doing the potentially expensive (if you have lots of rows and religion is not indexed)

UPDATE users SET religion='Christianity' WHERE religion='Christian'

... you can do the much simpler and cheaper

UPDATE religions SET name='Christianity' WHERE id=123

Of course, you also enforce data integrity by keying against a religions table. It becomes impossible to insert an invalid value like the misspelled Christain.

Michael Berkowski
  • 267,341
  • 46
  • 444
  • 390
  • 1
    Great insight, but the question stands. You say "denormalized manner". Which normal form failure applies here? 1nf, 2nf or 3nf? – user Dec 30 '11 at 03:58
  • 1
    @user1122200 I think it is not strictly a violation of any of the first 3 NF, but will scale inefficiently. – Michael Berkowski Dec 30 '11 at 04:12
  • That is what I am asking. It does not seem to be a normal form failure at all. I'm not concerned about scaling, just about the normal form. It seems ugly and inefficient to store it like that, but it would be correctly normalized it seems. – user Dec 30 '11 at 04:15
  • 1
    @user1122200 Definitely hold out for an answer from someone with a very strong theoretical background. You may get a more nuanced interpretation. – Michael Berkowski Dec 30 '11 at 04:19
  • In designing a database, I see this problem come up frequently with many entity attributes. It does not seem to fail a normal form, but leaving it like it is will lead to a lot of redundancy. Is there a SQL server equivalent to enum? – user Dec 30 '11 at 04:27
  • @user1122200 Sorry I thought I saw a MySQL tag on this earlier and that's why I suggested the ENUM. SQL Server doesn't have enum, and it's done instead by storing the full value with check constraints. See http://stackoverflow.com/questions/1434298/sql-server-equivalent-to-mysql-enum-data-type – Michael Berkowski Dec 30 '11 at 04:30
1

I'm assuming that there's a list of valid religions; if you've just got the user entering their own string, then you have to store it in the user table and this is all moot.

Assume that religions are stored in their own table. If you're following well-established practices, this table will have a primary key which is an integer, and all references to entries in the table in other tables (such as the user table) will be foreign keys. The string method of storing religion doesn't violate any normal form (since the name of a religion is a candidate key for the religion table), but it does violate the practice of not using strings as keys.

(This is an interesting difference between the theory and practice of relational algebra. In theory, a string is no different from an integer; they're both atomic mathematical values. In practice, strings have a lot of overhead that leads programmers not to use them as keys.)

Of course, there are other ways (such as ENUM for some RDBMSes) of storing a list of possible values, each with their own advantages and disadvantages.

Taymon
  • 24,950
  • 9
  • 62
  • 84
  • "but it does violate the practice of not using strings as keys" The string would not be the key. If these were moved to a separate table, I would make a ReligionID that is numerical – user Dec 30 '11 at 05:15
  • The main main point here is efficiency. Adding an extra table would mean an extra join, which is fine, but not when this strategy is used for multiple attributes. That could lead to multiple joins just to pull up the information about a user. Too many joins. And if this method does not violate a normal form, I see no reason to add an extra table. – user Dec 30 '11 at 05:52
0

Your normal forms are a little awry. Second normal form is that the rest of the row depends on "the whole key". Third normal form is that the rest of the row depends on "nothing but the key." (So help me Codd).

No, your situation as described does not violate any of the first three normal forms. (It might violate the sixth, depending on other factors).

Burleigh Bear
  • 3,274
  • 22
  • 32
  • The normal forms as stated are fine. Your version restates the same thing. "The whole key" refers to both parts of a concatenated primary key. "Nothing but the key" refers to a nonkey attribute depending on another nonkey attribute. – user Dec 30 '11 at 05:14
  • Ah, sorry, I thought you meant that second normal form outlawed composite primary keys. My bad. I've not heard 'concatenated primary key' as a phrase, so just guessed at its meaning. – Burleigh Bear Dec 30 '11 at 05:38
0

There are a few cons with this approach (compared to using a foreign key) that you will need to make sure you are ok with. 1 - wastes storage. 2 - slower to query by religion 3 - someone might put data in there that doesn't match, eg manually insert "Jedi" or something that you might not consider correct 4 - there's no way to have a list of possible religions (eg if there are no one of a certain religion in your table, eg, Zoroastrian) but you still want it to be a valid possibility 5 - incorrect capitalization might cause problems 6 - white space around the string might cause problems

The main pro with this technique is the data is quicker to pull out (no joining on a table) and it is also quicker for a human to read.

Rocklan
  • 7,888
  • 3
  • 34
  • 49
  • I was thinking to use a mySQL enum as suggested here, or a CHECK constraint in SQL. That would eliminate data errors (capitalization, spelling, etc). How do you think this method wastes storage and is slower? – user Dec 30 '11 at 20:20