Data normalization and writing queries

Question

I'm a jr. developer (5 months on the job), and I have a question about data normalization. Now, as I understand it, the general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum. In my project, one of the DB people created a DB. We have 50+ tables, and the tables in the DB are usually very fragmented, ie. a table has two or three columns and that's it. Now, when it comes to writing sql queries, it has become something of a minor hassle since each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization? Or does this point to something else?

I know that the easiest thing to do, for me, would be to write tables based off the queries I have to write. This will create a DB with a lot of redundant data, but I was curious if there is a happy medium?

Just as a postscript, I don't want to come across like I'm whining about my work, but I'm genuinely curious to learn more about this. My work environment is not the most friendly so I don't feel comfortable posing this question with my colleagues. However, I would appreciate any thoughts, books, tutorials or opinions from more experienced people.

Thanks.

Exactly, as you say it. It's a side effect of data normalization. Queries are usually harder to write but you have HUGE gains in minimum redundancy, reduced data modofication anomalies, faster searches, faster sorting and many more. — ypercubeᵀᴹ, Jun 22 '11 at 00:47
Hard to answer such a general question without seeing why the data is so fragmented. Assuming the data isn't badly normalized, typically you create the `TABLE`s in a de-normalized state and then use `VIEW`s to save common queries for reuse. — Bacon Bits, Jun 22 '11 at 00:48
See this question: http://stackoverflow.com/questions/1724381/explaining-why-just-add-another-column-to-the-db-is-a-bad-idea-to-non-programm — ypercubeᵀᴹ, Jun 22 '11 at 00:49
And this: http://stackoverflow.com/questions/246701/what-is-normalisation-or-normalization-why-is-it-important — ypercubeᵀᴹ, Jun 22 '11 at 00:50

S.Lott · Accepted Answer · 2011-06-29T10:05:14.830

4

general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Only partly true.

Normalization is not about "redundancy".

It's about "update anomalies".

1NF is the "don't use arrays" rules. Breaking 1NF means a row isn't atomic, but a collection and independent updates in the collection wouldn't work out well. There'd be locking and slowness.

2NF is the "one key" rule. Each row has exactly one key and everything in the row depends on the key. There are no dependencies on part of the key. Some folks like to talk about candidate keys and natural keys and foreign keys; they may exist or they may not. 2NF is satisfied when all attributes depend on one key. If the key is a single-column surrogate key, this normal form is trivially satisfied.

If 2NF is violated, you've got columns which depend on part of a key, but not the whole key. If you had a table with (Part Number, Revision Number) as a key, and attributes of color and weight, where weight depends on the whole key, but color only depends on the part number. You have a 2NF problem where you could update some part colors but not others, creating data anomalies.

3NF is the "only the key" rule. If you put derived data in a row, and change the derived result, it doesn't match the source columns. If you change a source column without updating the derived value, you have a problem, too. Yes, triggers are a bad hackaround to allow 3NF design violations. That's not the point. The point is merely to define 3NF and show that it prevents an update problem.

each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization?

It is.

edited Jun 29 '11 at 10:05

answered Jun 22 '11 at 00:52

S.Lott

384,516
81
508
779

+1 but really the only normal forms worthy of mention are BCNF (what 3NF should have been), 5NF ('fully normalized' and always achievable) and 6NF (always achievable but not always desirable). – onedaywhen Jun 22 '11 at 07:42
1

You said "Each row has exactly one key". That's simply wrong. None of the normal forms require a relation to have only one key, and 2NF doesn't require counting candidate keys. 2NF has to do with whether a non-prime attribute is functionally dependent on the whole of every key (a Good Thing), or only on part of any key (a Bad Thing, requiring further normalization). – Mike Sherrill 'Cat Recall' Jun 24 '11 at 22:39
@Catcall: Each row has exactly one key. Why does the word "primary" always have to appear? Any "foreign keys" are just data elements which depends on the key. – S.Lott Jun 27 '11 at 10:14
1

@S.Lott: When you say, "Each row has exactly one key", you're mistaken. See, for example, Date's *An Introduction to Database Systems*, where you might find the quote "We have seen that it is possible for a given relvar to have more than one candidate key." (7th ed., p 260) – Mike Sherrill 'Cat Recall' Jun 27 '11 at 12:18
@Catcall: candidate keys and natural keys are not *always* primary keys. Often primary keys are surrogate keys. – S.Lott Jun 27 '11 at 12:24
1

I never said candidate keys or natural keys were always the primary key. I said that your claim that "Each row has exactly one key" is wrong. A table that has 'n' columns can have 'n' candidate keys. The relational model doesn't require any of them to be foreign keys, and it also doesn't provide any logical basis for designating one candidate key as primary, and all other candidate keys as secondary. (There might well be good practical reasons to do so, but those reasons are outside the relational model.) – Mike Sherrill 'Cat Recall' Jun 27 '11 at 13:34
@Catcall: I'm trying to avoid the subtlety of candidate keys and natural keys in the answer. It's too subtle -- as you correctly note. Key here means "Key". If "Primary Key" makes you happier, then substitute that phrase for "Key". Also, please avoid simply stating that I'm "mistaken". That doesn't help me understand what phrase would make you happy. Please provide the exact words that would make you happy. – S.Lott Jun 27 '11 at 14:16
I'm not very strong in theoretical terms, but [according to wikipedia](http://en.wikipedia.org/wiki/Second_normal_form) @Catcall is right. Personally I read your answer as "Each row has exactly one **unique** key", which is not what 2NF is about. If you meant **primary** key, it's pretty strange (though true) statement and still it's not what 2NF is about. If you use other definition of 2NF, please point to it – Alexander Malakhov Jun 29 '11 at 08:03
Probably I've put my words wrong way. What I don't like in your answer is that phrase "Each row has **exactly one** key..." is in direct contradiction with wikipedia's "1NF table is in 2NF iff, given **any** candidate key K...". Just these 2 words – Alexander Malakhov Jun 30 '11 at 02:25
@Alexander Malakhov: As I said to @Catcall; I'm specifically and intentionally ignoring "candidate keys" and foreign keys. How many times do I need to repeat the word Primary to satisfy you? Please provide a suggested rewrite. – S.Lott Jun 30 '11 at 09:41
1

Please don't be offended, I'm not trying to be picky or anything. But I honestly believe 2NF part of your answer is incorrect. If you meant "Primary key", then `Each row has exactly one primary key` is weird and makes no sense in 2 ways: **1** (*exactly one*) by very definition of primary key there can be only one **disregarding** any NF **2** (*Each row*) there is absolutely no way that some rows have primary key and others don't. Do you agree (I'll address other points later) ? – Alexander Malakhov Jul 01 '11 at 02:42
@Alexander Malakhov: 1 ("exactly one") A poor design could attempt to have two primary keys per row by doing a join between two tables. 2 ("each row") A poor design could attempt to have a nullable or optional primary key. The 2NF essentially provides a definition for "Primary key". That's why they don't use the word "Primary" when describing 2NF. 2NF defines what it means to be primary. One and only One. – S.Lott Jul 01 '11 at 11:33
-1 for factually incorrect answer. Sorry, but I just don't know how else can I point out that you contradict to the very definitions of terms. And it's already the longest discussion I've ever taken on SO. No offense – Alexander Malakhov Jul 20 '11 at 09:28
@Alexander Malakhov: "factually incorrect answer"? Is this still the "exactly one" part of 2NF? Or is there some other words or phrase you don't like? – S.Lott Jul 20 '11 at 09:57
Yeah, primarly "exactly one". What I expected from you, was that you will comment somehow on wikipeia's article. Because it doesn't mention, nor imply the one PK (or candidate or unique key) – Alexander Malakhov Jul 20 '11 at 14:45
@Alexander Malakhov: Wikipedia says "In addition to the primary key, the table may contain other candidate keys". I've specifically disregarded candidate keys (because I encourage use of surrogate keys). "*the* primary key" sure sounds like "exactly one" to me. – S.Lott Jul 20 '11 at 15:06
and this sentance ends "...;it is necessary to establish that no non-prime attributes have part-key dependencies on **any** of these candidate keys" (font weight kept from wikipedia). And below follows example to demonstrate that table can have 2 candidate keys and only one PK, but still it will not conform 2NF. No one ever argued there could be more than one, it's [imposible by definition](http://en.wikipedia.org/wiki/Primary_key). The whole 2NF is about candidate keys, and what should be done if there are more than one. – Alexander Malakhov Jul 20 '11 at 15:30
Specifically regarding I said "... it[wikipedia] doesn't mention, nor imply the one PK": the definition of PK clearly states that there can be only 1, and of course any article mentioning PK will imply there can be only one. But what I had in mind was that definition of 2NF doesn't imply anything about PKs – Alexander Malakhov Jul 20 '11 at 15:52
-1 for the "one key" part of the answer. Whether you include a surrogate or not is irrelevant. 2NF is violated if a nonprime attribute depends on only part of *any* key - adding a surrogate doesn't change that. 2NF could more accurately be described as the "All of every key rule". – nvogel Feb 28 '12 at 05:50

score 3 · Answer 2 · edited May 23 '17 at 12:07

Now, as I understand it, the general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Ummm, ok.

In my project, one of the DB people created a DB. We have 50+ tables, and the tables in the DB are usually very fragmented, ie. a table has two or three columns and that's it.

The number of tables doesn't say anything about whether the design is good or bad. Some businesses need one or two. Others need more. I've worked on databases at Fortune 500s that had thousands of tables.

The number of columns doesn't say anything about whether the design is good or bad. And the number of columns has nothing to do with fragmentation. I will say that tables that have relatively few columns is generally a good sign. Not always a good sign, but generally a good sign.

Now, when it comes to writing sql queries, it has become something of a minor hassle since each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization? Or does this point to something else?

There are two different, common reasons for that.

When you normalize a table, you reduce redundancy (and increase data integrity) by identifying functional dependencies, isolating the functionally dependent columns in one or more new tables, and removing them from the original table. So normalizing a table, in the sense of moving from a lower normal form to a higher normal form

always increases the number of tables,
always reduces the number of columns in the original table, and
sometimes requires a join to retrieve data for humans.

Another common practice is to replace strings with id numbers. This has nothing to do with normalization. (There's no such thing as "id number normal form".) Replacing strings with id numbers

always increases the number of tables,
doesn't change the number of columns in the original table (unless done at the same time as normalization),
always requires a join to retrieve data for humans.

There seems to be some confusion in other parts of this thread. I realize that, strictly speaking, none of the following is directly related to the OP's question.

1NF is the "one value" principle. It doesn't have anything to do with a row being "atomic". In the relational model, atomic doesn't refer to rows; it refers to values.

"One value" means that each intersection of a row and a column contains a single value. (In other words, the value is "atomic". But the word atomic has some unfortunate connotations, so most modern practitioners avoid it.) That value doesn't need to be simple; it can be arbitrarily complex. But if it has parts that themselves have meaning, the dbms either completely ignores those parts, or it provides functions to manipulate them. (You don't have to write functions to manipulate the parts.)

I think the simplest example is a date. Dates have parts, consisting of a year, month, and day. The dbms either ignores those parts (as in SELECT CURRENT_DATE) or it provides functions to manipulate them (as in SELECT EXTRACT(YEAR FROM CURRENT_DATE)).

Attempts to dodge the "one value" principle lead to a corollary: the "no repeating groups" principle.

A repeating group involves multiple values from one domain, all values having the same meaning. So a table like the following one is an example of one kind of repeating group. (There are other kinds.) Values for both "phone_1" and "phone_2" come from the same domain, and they have the same meaning--user 'n' has phone numbers (phone_1 and phone_2). (Primary key is "user_id".)

user_id    phone_1           phone_2    
1          (111) 222-3333    (111) 222-3334
2          (111) 222-3335    (111) 222-3336

But the next table, although it's very similar, doesn't have a repeating group. The values come from the same domain, but they don't have the same meaning. (Primary key is "user_id".)

user_id    home_phone        work_phone    
3          (111) 222-3333    (111) 222-3334
4          (111) 222-3335    (111) 222-3336

2NF is the "whole key" principle. It doesn't have anything to do with the number of keys; a table having 'n' columns could have 'n' keys. (See, for example, this other SO answer.) In the relational model (and, by extension, when you're doing normalization exercises), if you see the word key by itself, think "candidate key".

Instead, 2NF has to do with candidate keys that have multiple columns. When a candidate key has multiple columns, 2NF requires that every non-prime attribute be functionally dependent on all the columns of every candidate key, not on just some of the columns of any candidate key. (A non-prime attribute is an attribute that's not part of any candidate key.)

The following example is adapted from the Wikipedia entry on 2nf. (Primary key is {employee, skill}.)

Table: employee_skills
employee        skill            current_work_location
--
Jones           Typing           114 Main Street
Jones           Shorthand        114 Main Street
Jones           Whittling        114 Main Street
Bravo           Light Cleaning   73 Industrial Way
Ellis           Alchemy          73 Industrial Way
Ellis           Flying           73 Industrial Way
Harrison        Light Cleaning   73 Industrial Way

While it's true that the non-prime column current_work_location is functionally dependent on the primary key {employee, skill}, it's also functionally dependent on just part of the primary key, "employee". That table isn't in 2NF.

You can't dodge a 2NF problem by assigning each row a surrogate key. (Primary key is es_id; there's a UNIQUE constraint on the former primary key, {employee, skill}).

Table: employee_skills
es_id   employee        skill            current_work_location
--
1       Jones           Typing           114 Main Street
2       Jones           Shorthand        114 Main Street
3       Jones           Whittling        114 Main Street
4       Bravo           Light Cleaning   73 Industrial Way
5       Ellis           Alchemy          73 Industrial Way
6       Ellis           Flying           73 Industrial Way
7       Harrison        Light Cleaning   73 Industrial Way

It should be obvious that adding the id number did nothing to remove the partial dependency employee->current_work_location. Without removing the partial dependency, this table is still not in 2NF.

3NF is the "no transitive dependencies" principle. It doesn't necessarily have anything to do with derived or calculated data, as you can tell from the Wikipedia example, adapted here. (Primary key is {tournament, year}. This table is not in 3NF.)

Table: tournament_winners
tournament             year  winner            winner_date_of_birth
--
Indiana Invitational   1998  Al Fredrickson    21 July 1975
Cleveland Open         1999  Bob Albertson     28 September 1968
Des Moines Masters     1999  Al Fredrickson    21 July 1975
Indiana Invitational   1999  Chip Masterson    14 March 1977

Two dependencies show that this table has a transitive dependency.

The values in winner_date_of_birth appear to be functionally dependent on the primary key. Each primary key value determines one and only one value for winner_date_of_birth. But . . .
The values in winner_date_of_birth also appear to be functionally dependent on winner. Each value for winner determines one and only one value for winner_date_of_birth.

Given those two apparent functional dependencies and an understanding of what the words tournament, winner, and date of birth mean, we can say that

winner -> winner_date_of_birth is a functional dependency, and
{tournament, year} -> winner is a functional dependency, and
{tournament, year} -> winner_date_of_birth is a transitive dependency.

It's kind of sad you've taken less upvotes then accepted answer. Seems like people vote for the first not stupid answer they see. — Alexander Malakhov, Jul 20 '11 at 09:33
@Alexander Malakhov: It might change over time. I get upvotes once in a while for stuff I wrote months ago. And sometimes, an OP changes the accepted answer. If you think my answer is useful for other questions on normalization, you can link to it instead of to the accepted answer. FWIW, I think normalization is more misunderstood than any other facet of database or programming work. — Mike Sherrill 'Cat Recall', Jul 21 '11 at 00:00

score 2 · Answer 3 · answered Jun 22 '11 at 03:09

Database views are a critically important tool in this dilemma. This excellent introduction says:

Here's the good news: you don't have to work with the normalized tables! ... It is very easy (at least for DBAs) to create an abstraction layer of joined views on top of the normalized data tables, putting the base tables completely "behind the scenes", and out of sight.

score 1 · Answer 4 · answered Jun 22 '11 at 00:50

It sounds like data normalization, but I would have to know more about the schema, the business case, etc to make that call reliably. If you had control of the database, you could write a view that represents common queries that link tables. In order to increase performance, you could create an indexed or materialized view (the name depends on the database platform, in this case, Oracle vs. Sql Server).

Just about any database primer will help you along with these concepts. If you are using Sql Server and are really interested in learning more, SQL Server Books Online is an excellent resource.

score 0 · Answer 5 · answered Jun 22 '11 at 00:48

Having large numbers of tables is definitely a symptom of a well normalized database design. This can be a pain when writing queries, but it is much better then getting data out of sync.

I sometimes write reports that run off of a database with thousands of tables. Each night, we have a program that runs and dumps data from the production tables into a data warehouse so that we can report against it more easily. The data warehouse tables are much less normalized, and this makes writing queries much simpler. You may want to consider something like this if it makes sense in your situation.

score 0 · Answer 6 · answered Jun 22 '11 at 00:51

0

Without seeing the data, it is hard to say if your data is over-normalized (or just not normalized correctly - spreading fields over several tables doesn't mean it's normalized) Generally speaking though, you will probably have to join several tables to see useful data in a well normalized database.

You can create views that join tables together, then you can query the view. This will probably help when selecting data.

answered Jun 22 '11 at 00:51

jlnorsworthy

3,914
28
34

@onedaywhen How about instead of a snarky comment, you add some value. You have never heard of the idea of over normalization, or you don't think it is possible? In your own comment you mention 6NF as always achievable but not always desirable. If you achieved 6NF when it wasn't desirable, wouldn't that then be considered over normalization? – jlnorsworthy Jun 22 '11 at 14:03
Sorry I didn't elaborate. Phrases such as "well normalized" and "over normalization" are informal terms that aren't useful and have virtually no meaning. Also, your answer doesn't seem to me to address the main point, whereas the first three sentences of @S.Lott's answer does (though the points about 1NF, 2NF, 3NF aren't really relevant), so they get upvoted and you get downvoted. I could post an example of when 6NF is achievable but not desirable but I don't think it would add value in this case and no, "over normalization" would not apply. – onedaywhen Jun 23 '11 at 06:12

score 0 · Answer 7 · answered Jun 22 '11 at 03:40

In a well designed database the joins you need in your queries should be pretty easy to code up. The downside is you have verbose SQL. The upsides are enormous:-

Consistent easy to update tables.
Quickly change according to business needs. Well designed DBs can usually handle queries that were not even thought of during the original design.
Quickly accommodate new entities. Its relatively easy to add new data entities and attributes to a well designed database. It can be a nightmare incorporating seemingly simple changes to a de-normalised database.

Data normalization and writing queries

7 Answers7

Linked