Now, as I understand it, the general
principle behind data normalization is
to create a RDBMS where data
redundancy is kept to a minimum.
Ummm, ok.
In my project, one of the DB people
created a DB. We have 50+ tables, and
the tables in the DB are usually very
fragmented, ie. a table has two or
three columns and that's it.
The number of tables doesn't say anything about whether the design is good or bad. Some businesses need one or two. Others need more. I've worked on databases at Fortune 500s that had thousands of tables.
The number of columns doesn't say anything about whether the design is good or bad. And the number of columns has nothing to do with fragmentation. I will say that tables that have relatively few columns is generally a good sign. Not always a good sign, but generally a good sign.
Now, when it comes to writing sql
queries, it has become something of a
minor hassle since each query involves
combing through several different
tables and joining them together. I
was wondering if this is a a side
effect of data normalization? Or does
this point to something else?
There are two different, common reasons for that.
When you normalize a table, you reduce redundancy (and increase data integrity) by identifying functional dependencies, isolating the functionally dependent columns in one or more new tables, and removing them from the original table. So normalizing a table, in the sense of moving from a lower normal form to a higher normal form
- always increases the number of
tables,
- always reduces the number of columns
in the original table, and
- sometimes requires a join to retrieve
data for humans.
Another common practice is to replace strings with id numbers. This has nothing to do with normalization. (There's no such thing as "id number normal form".) Replacing strings with id numbers
- always increases the number of
tables,
- doesn't change the number of columns
in the original table (unless done at
the same time as normalization),
- always requires a join to retrieve data for humans.
There seems to be some confusion in other parts of this thread. I realize that, strictly speaking, none of the following is directly related to the OP's question.
1NF is the "one value" principle. It doesn't have anything to do with a row being "atomic". In the relational model, atomic doesn't refer to rows; it refers to values.
"One value" means that each intersection of a row and a column contains a single value. (In other words, the value is "atomic". But the word atomic has some unfortunate connotations, so most modern practitioners avoid it.) That value doesn't need to be simple; it can be arbitrarily complex. But if it has parts that themselves have meaning, the dbms either completely ignores those parts, or it provides functions to manipulate them. (You don't have to write functions to manipulate the parts.)
I think the simplest example is a date. Dates have parts, consisting of a year, month, and day. The dbms either ignores those parts (as in SELECT CURRENT_DATE
) or it provides functions to manipulate them (as in SELECT EXTRACT(YEAR FROM CURRENT_DATE)
).
Attempts to dodge the "one value" principle lead to a corollary: the "no repeating groups" principle.
A repeating group involves multiple values from one domain, all values having the same meaning. So a table like the following one is an example of one kind of repeating group. (There are other kinds.) Values for both "phone_1" and "phone_2" come from the same domain, and they have the same meaning--user 'n' has phone numbers (phone_1 and phone_2). (Primary key is "user_id".)
user_id phone_1 phone_2
1 (111) 222-3333 (111) 222-3334
2 (111) 222-3335 (111) 222-3336
But the next table, although it's very similar, doesn't have a repeating group. The values come from the same domain, but they don't have the same meaning. (Primary key is "user_id".)
user_id home_phone work_phone
3 (111) 222-3333 (111) 222-3334
4 (111) 222-3335 (111) 222-3336
2NF is the "whole key" principle. It doesn't have anything to do with the number of keys; a table having 'n' columns could have 'n' keys. (See, for example, this other SO answer.) In the relational model (and, by extension, when you're doing normalization exercises), if you see the word key by itself, think "candidate key".
Instead, 2NF has to do with candidate keys that have multiple columns. When a candidate key has multiple columns, 2NF requires that every non-prime attribute be functionally dependent on all the columns of every candidate key, not on just some of the columns of any candidate key. (A non-prime attribute is an attribute that's not part of any candidate key.)
The following example is adapted from the Wikipedia entry on 2nf. (Primary key is {employee, skill}.)
Table: employee_skills
employee skill current_work_location
--
Jones Typing 114 Main Street
Jones Shorthand 114 Main Street
Jones Whittling 114 Main Street
Bravo Light Cleaning 73 Industrial Way
Ellis Alchemy 73 Industrial Way
Ellis Flying 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
While it's true that the non-prime column current_work_location is functionally dependent on the primary key {employee, skill}, it's also functionally dependent on just part of the primary key, "employee". That table isn't in 2NF.
You can't dodge a 2NF problem by assigning each row a surrogate key. (Primary key is es_id; there's a UNIQUE constraint on the former primary key, {employee, skill}).
Table: employee_skills
es_id employee skill current_work_location
--
1 Jones Typing 114 Main Street
2 Jones Shorthand 114 Main Street
3 Jones Whittling 114 Main Street
4 Bravo Light Cleaning 73 Industrial Way
5 Ellis Alchemy 73 Industrial Way
6 Ellis Flying 73 Industrial Way
7 Harrison Light Cleaning 73 Industrial Way
It should be obvious that adding the id number did nothing to remove the partial dependency employee->current_work_location
. Without removing the partial dependency, this table is still not in 2NF.
3NF is the "no transitive dependencies" principle. It doesn't necessarily have anything to do with derived or calculated data, as you can tell from the Wikipedia example, adapted here. (Primary key is {tournament, year}. This table is not in 3NF.)
Table: tournament_winners
tournament year winner winner_date_of_birth
--
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
Two dependencies show that this table has a transitive dependency.
- The values in winner_date_of_birth
appear to be functionally dependent on the
primary key. Each primary key value
determines one and only one value
for winner_date_of_birth. But . . .
- The values in winner_date_of_birth
also appear to be functionally dependent on
winner. Each value for winner
determines one and only one value
for winner_date_of_birth.
Given those two apparent functional dependencies and an understanding of what the words tournament, winner, and date of birth mean, we can say that
- winner -> winner_date_of_birth is a
functional dependency, and
- {tournament, year} -> winner is a functional dependency, and
- {tournament, year} ->
winner_date_of_birth is a transitive
dependency.