8

Consider the following relation to illustrate my question:

Person( name, street, city, zipcode )

name -> street , city , zipcode
street + city -> zipcode

So if we know the name, we also know where the person lives. But zipcode is also (transient) dependent on street + city. Thus, this relation breaks 3NF and should be split up into two tables to conform.

But in this case, we are not interested in zipcodes as a separate entity. It is part of the address, and just happens to be a transient dependent. We will never use it separately.

I understand why normalization is a good thing. But is it really necessary to always normalize (and thus make the database more complex)? If not, how do you know when you can skip it?

(if my terminology or notation is wrong, you are welcome to correct me)

Vikdor
  • 23,934
  • 10
  • 61
  • 84
rlv-dan
  • 976
  • 1
  • 10
  • 21
  • 2
    As with any "rule" in programming, you may break it if you can justify breaking it. You typically denormalize for performance. – Bohemian Sep 23 '12 at 13:45
  • 1
    There are two reasons you might consider denormalizing here. The first is that wider tuples offer richer constraint possibilities. Thus you can define wider tables in order to define constraints you can't define otherwise. The other is for performance. The former is more common. I have been known to advocate breaking 1NF where constraints require it. – Chris Travers Sep 24 '12 at 10:04

5 Answers5

9

In addition to performance, one more reason for not fully normalizing might be if you have a certain "fuzziness" in your data.

As far as I understand1, ZIP may be specific to a city block or area, which means an especially long street could have more than one ZIP. And even if ZIP did correspond to city+street in US, that might not be true for postal codes in other countries, should you ever decide to go international.

But even assuming ZIPs really are city+street specific, human beings likely enter the address information themselves, which means they can make mistakes including incorrect ZIP. So you could end-up with two ZIPs for the same combination of city and street.

A fully normalized database simply has no way of representing that - you'd have to pick one of the ZIPs somehow. Unless you have an access to a full, up-to-date database of all ZIPs, you have no good way of resolving this conflict. If you end-up picking the wrong ZIP, all persons on the same city+street will have the wrong ZIP.

On the other hand, de-normalized database will let each person keep their own ZIP and later suffer the consequences in isolation from other persons. You could even implement an auto-complete suggestion and "are you sure?" warning in case user enters a different ZIP for the existing city+street that already has a ZIP, but then let him (or her) proceed if he indicated he is sure.


1 And I don't live in US, so I might be off.

Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
7

Normalization is a tool for analysing dependencies and ensuring the correct implementation of data integrity rules (business rules) represented as dependencies. An underlying assumption of normalization is that you know or can determine which business rules you actually want to implement. If you are already certain you don't want or need to enforce a given business rule then there is probably little value in considering it as a dependency when designing the database for it. Remember that the point of dependencies is that a rule is in force at all times for all possible data in the database; not just for current data or some particular subset of data.

It may be the case that the dependency {street,city} -> {zipcode} is not truly the desired business rule for the system and therefore should not be enforced. E.g. if data has to be entered without address verification software it may be impractical to ensure that zipcodes are consistent in that way. That doesn't mean you are violating any normalization rule. It just means the functional dependency isn't intended to hold and does not hold and therefore it isn't a transitive dependency in any real sense.

nvogel
  • 24,981
  • 1
  • 44
  • 82
3

The value and the cost of pushing normalization all the way depends. It depends mostly on what you will be doing with the data.

There are (at least) two radically different ways of using the data. One is On Line Transaction Processing (OLTP). The other is On Line Analytical Processing (OLAP).

In OLTP, the cost of not normalizing can be quite high. Transactions get to be more complex and slower, and bottlenecks degrade performance. In OLAP, the benefits of normalizing are limited, and there are other design disciplines that can yield more benefits for the same effort. One of those alternatives is star schema design, which you can look up.

But it isn't so much a matter of NOT normalizing, or of DEnormalizing, but of following a different design discipline, even if it doesn't result in a normalized database.

Getting back to the speciifc case you outlined, there are lots of systems where there is a heavy transaction load on customer activity, but the customer table is used for read only purposes in those transactions.

Failure to conform to 3NF is only going to hurt you when you have to enter a new customer, and you have to enter the zip code all over again, when there are already other customers with the same city, street, and zip code. And in the event that the post office changes the zip code assignment of a given street, you'll have to update lots of addresses instead of just one row in a normalized table.

That's not a very high cost, and not a very likely event.

On the other hand, how likely is it that the Post Office will take a single street, and split that street between two zip codes, depending on which block in the street the address is on? If this latter event happens, you're actually better off with the structure that violates 3NF. You are free to enter different zip codes for each address, using the information the Post Office gave about the split.

So, how likely is this second scenario? I think it's more likely than the first. But you need to go with your guess, and not mine.

Walter Mitty
  • 18,205
  • 2
  • 28
  • 58
2

I'm not American so I am hesitant to say this, but I don't think you understand zipcodes. Some individual buildings have a zipcode of their own. Zipcodes can cross state boundaries. A zipcode can represent a PO Box with any geographical significance whatsoever.

So, regqrdless of the benefits of normalisation, your example is a bad one to pick. There is no clear correlation between (street,city) and zipcode.

It's possible that that I have this wrong, but I know in the UK streets (even quite short streets) can have more than one postcode.

APC
  • 144,005
  • 19
  • 170
  • 281
  • You're right about ZIP codes. You can answer the OP's question as it relates to functional dependencies even though the example isn't accurate. When I answered earlier, I debated whether to point that out, and decided not to. (This time.) – Mike Sherrill 'Cat Recall' Sep 24 '12 at 00:13
  • @Catcall - I, on the other hand, decided that enough people had already debated the generic normalisation aspects of the question. I think the accuracy of the example is relevant, because a wrongly-denormalised data model is even worse than an insufficiently normalised one: at least we can rescue the latter. – APC Sep 24 '12 at 12:46
  • @Catcall&APC. There are two ways to go wrong here. One is to fail to use functional dependencies that are really inherent in the data. The other is to rely on mythological functional dependencies that are not really inherent in the data. This seems to be a case of the latter. – Walter Mitty Sep 25 '12 at 12:09
0

If {street, city}->{zipcode}, then that constraint needs to be made known to the dbms so the dbms can enforce it. Otherwise, you soon end up with data that looks like this.

name           street              city              zipcode
--
Barack Obama   Pennsylvania Ave    Washington, DC    90210

90210 is a ZIP code, but it's for Beverly Hills, California.

It's a rare application that can truly tolerate bad data like that.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • To enforce the rule we would require a complete set zipcodes available to be able to enforce, which we might not have. I'm thinking that we know from government information that the rule applies, but it's not *our* rule? – rlv-dan Sep 23 '12 at 14:42
  • No, it doesn't require a complete set of ZIP codes. It requires that you have a procedure for adding new rows (new ZIP codes) at runtime. Almost all tables require rows to be added at runtime. – Mike Sherrill 'Cat Recall' Sep 23 '12 at 15:04