15

Supposedly normalization reduces redundancy of data and increases performance. What is the reason for dividing the master table into other small tables, applying relationships between them, retrieving the data using all possible unions, subqueries, joins etc.? Why can't we have all the data in a single table and retrieve it as required?

philipxy
  • 14,867
  • 6
  • 39
  • 83

5 Answers5

17

The main reason is to eliminate repetition of data, so for example if you had a user with multiple addresses and you stored this information in a single table the user information would be duplicated along with each address entry. Normalisation would seperate the addresses into their own table and then link the two using keys. This way you wouldn't need to duplicate the user data, and your db structure becomes a little cleaner.

Full normalisation will generally not improve performance, in fact it can often make it worse but it will keep your data duplicate free. In fact in some special cases I've denormalised some specific data in order to get a performance increase.

Steve Temple
  • 2,238
  • 19
  • 27
  • 1
    so if there is no duplication you should not normalize? – IAdapter Jul 09 '09 at 10:00
  • If you just had a list of users with only ever one address then there would be no need to break the data into multiple tables. I would still advise using a unique key for that data. – Steve Temple Jul 10 '09 at 01:18
  • Normalization involves "eliminating repetition of data", but that everyday phrase is unhelpfully vague, whereas normalization actually *defines* one specific thing reasonably called that. (Ie enable updates via one shorter row needed multiple longer rows.) (Ie it "eliminates update anomalies".) Your example doe *not* need normalizing nor did you normalize it. Also it isn't a "repetition" that is necessarily bad. Also normalization does not introduce new columns, ie keys, as in your example. Eg your introduced id values "repeat" where where they replace addresses exactly as the addresses did. – philipxy Aug 08 '15 at 00:13
  • Normalization decomposes to multiple smaller tables that join to an original. (And it doesn't add new columns.) Performance change depends on use. Eg it can give an enormous improvement in performance, in the ratio of the product of smaller table cardinalities to their sum. People always focus on the join cost when we need the original; but every time we use just some of the smaller we save. In fact whenever we start with multiple tables we have intuitively normalized compared to using fewer tables. (Ie removed a JD). – philipxy Aug 08 '15 at 00:26
10

Normalization comes from the mathematical concept of being "normal." Another word would be "perpendicular." Imagine a regular two-axis coordinate system. Moving up just changes the y coordinate, moving to the side just changes the x coordinate. So every movement can be broken down into a sideways and an up-down movement. These two are independent of each other.

Normalization in database essentially means the same thing: If you change a piece of data, this is supposed to change just one single piece of information in a database. Imagine a database of E-Mails: If you store the ID and the name of the recipient in the Mails table, but the Users table also associates the name to the ID, that means if you change a user name, you don't only have to change it in the users table, but also in every single message that this user is involved with. So, the axis "message" and the axis "user" are not "perpendicular" or "normal."

If on the other hand, the Mails table only has the user ID, any change to the user name will automatically apply to all the messages, because on retrieval of a message, all user information is gathered from the Users table (by means of a join).

balpha
  • 50,022
  • 18
  • 110
  • 131
6

Database normalisation is, at its simplest, a way to minimise data redundancy. To achieve that, certain forms of normalisation exist.

First normal form can be summarised as:

  • no repeating groups in single tables.
  • separate tables for related information.
  • all items in a table related to the primary key.

Second normal form adds another restriction, basically that every column not part of a candidate key must be dependent on every candidate key (a candidate key being defined as a minimal set of columns which cannot be duplicated in the table).

And third normal form goes a little further, in that every column not part of a candidate key must not be dependent on any other non-candidate-key column. In other words, it can depend only on the candidate keys. This leads to the saying that 3NF depends on the key, the whole key and nothing but the key, so help me Codd1.

Note that the above explanations are tailored toward your question rather than database theorists, so the descriptions are necessarily simplified (and I've used phrases like "summarised as" and "basically").

The field of database theory is a complex one and, if you truly wish to understand it, you'll eventually have to get to the science behind it. But, in terms of your question, hopefully this will be adequate.

Normalization is a valuable tool in ensuring we don't have redundant data (which becomes a real problem if the two redundant areas get out of sync). It doesn't generally increase performance.

In fact, although all database should start in 3NF, it's sometimes acceptable to drop to 2NF for performance gains, provided you're aware of, and mitigate, the potential problems.

And be aware that there are also "higher" levels of normalisation such as (obviously) fourth, fifth and sixth, but also Boyce-Codd and some others I can't remember off the top of my head. In the vast majority of cases, 3NF should be more than enough.


1 If you don't know who Edgar Codd (or Christopher Date, for that matter) is, you should probably research them, they're the fathers of relational database theory.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • @philipxy, not sure I undetstand your first point, I could quite easily add a column that has nothing to do with a key (for example, a totally random number) - that would surely violate 1nf, yes? As to the second point, I can't see 'primary' anywhere in my answer though the phrase 'the key' may mislead in this direction. If you have an idea on how to avoid that, let me know, otherwise I'll give it a shot when I get to a real computer. – paxdiablo Aug 08 '15 at 00:56
  • Re 1: No, it doesn't. Normalization is not about some vague everyday term like "has nothing to do with". A specific term with specific definition that *is* relevant is "functionally depends on".) (I guess by "key" you mean CK (candidate key). A *definition* of CK is a column set whose subrows are unique and that contains no smaller column set whose subrows are unique. Re "every column depends on every key" see my comments on other answers. – philipxy Aug 08 '15 at 01:35
  • Re 2: Yes, the only way to make sense of "the key" is as "the PK". (Which makes your bullets wrong.) Re "how to avoid that": It's going to involve using "some CK(s)" and/or "all CKs" as appropriate. But if you want to write a *correct* answer in your own words for [2NF](https://en.wikipedia.org/wiki/Second_normal_form) & [3NF](https://en.wikipedia.org/wiki/Third_normal_form) then you are just going to have to use language and/or math to clearly express the same content. What can I say? Good writing takes effort to compose & repeatedly redraft, focusing on the reader's perspective. – philipxy Aug 08 '15 at 01:51
  • @philipxy, okay, I've tried to make it a *little* more formal but I don't want to run the risk of making it so dry no-one will read or understand it, preferring to (as you say) "focus on the reader's perspective". They can probably go buy the Codd/Date books if they want the *real* meat, or they could buy mine, which I haven't really used since Uni :-) Thanks for your thoughts/assistance, by the way. – paxdiablo Aug 08 '15 at 01:53
  • @philipxy, okay, added "disclaimers" that the answer is necessarily simplified, targeting the level of the OP. Hopefully, that will satisfy both camps. – paxdiablo Aug 08 '15 at 02:03
  • Good, because the point is, either it is a correct definition and they can use it to normalize or it's not and they can't. In the latter case why imagine that you have given a definition or explained it? Ie they will be able to read and understand *your answer*, and so understand *something about normalization*, but they won't *understand how to do normalization*. But it is reasonable to say something like, "involves..." followed by something that is actually clear and true, though (explicitly) not sufficient. – philipxy Aug 08 '15 at 02:03
  • PS apropos of my preceding comment & my others: "every column not part of a CK must be dependent on every CK" is just not 2nf. Every attribute is always dependent on every CK. Correct is, must be not dependent on any smaller subset of any CK (ie not "partially" dependent, ie "fully" dependent--technical terms). SImilarly re your 3nf characterization. If you can't see that hese are wrong then try to derive them from the links above. The difficulty of finding clear simple correct phrasings of definitions or even truths re normalization is why people don't understand it. Good luck. – philipxy Aug 11 '15 at 19:45
  • @philipxy, I still think you're wrong on that second statement as I've pointed out previously. I reiterate: if I add a random value to a table, it is *not* dependent on any CK. Ditto if I do something ridiculous like add a social security number to a table of geological rock types. However, I'll consider your other statements, I'm hoping it's just that I just haven't expressed my intent correctly. – paxdiablo Aug 11 '15 at 23:53
4

We use normalization to reduce the chances of anomalies that may arise as a result of data insertion, deletion, updation. Normalization doesnt necessarily increase performance.

There is much material on internet so i wont repeat the stuff here again. But you can have a look at Normalization rules Anomalies (others aswell)

Umair Ahmed
  • 11,238
  • 5
  • 33
  • 39
1

As well as all the above, it just makes a certain sense. Say you have a user and you want to record what kind of car they have.

Put that all in one table and then you're fine, until someone owns two cars... You're then going to need two rows for that person, and a way of making sure that you can link those two rows together...

And then what if you also want to record how many dogs they have? Same table with lots of confusing dups? Another table with your own custom logic to manage unique users?

Normalization keeps you away from a lot of these problems...

Paddy
  • 33,309
  • 15
  • 79
  • 114
  • There's nothing wrong per se with having a row for each address. It's only when that is present *when certain other things are* that normalizing is apropos... eg when we have multiple addresses *and* multiple dogs *and* it's not one dog per house or house per dog. So this really doesn't explain anomalies. (See my comments on SteveTemple's answer.) – philipxy Aug 08 '15 at 00:39