When does repeating data get removed: before/after normalization?

Question

Given this logical design:

R(a,b, c, d)
a is the only key. I can't underline it using this editor. 
a->b
a->c
a->d

It's in BCNF because there are no composite keys, no transitive dependencies, and no partial key dependencies.

However, we still have repeating data across rows in the attributes b, c, and d.

Do we introduce surrogate keys and rewrite it this way:

R(a, bID, cID, dID)
R1(bID, b)
R2(cID, c)
R3(dID, d)

if so, does that happen before or after normalization?

Surrogate keys might be compact but they do nothing to reduce duplication of values, and they add a level of indirection. Why do you view multiple appearances of the same value as a problem? — reaanb, Apr 04 '17 at 16:30
I see them as a problem for the reasons you explained. I do understand that those particular duplications are not an issue in the normalization process. — nicomp, Apr 04 '17 at 19:04
Whether you introduce surrogate keys before or after normalization makes no difference. — reaanb, Apr 04 '17 at 19:32
1. What is "repeating data across rows"? 2. Normalization does not add ids, it replaces a table by projections of it. 3. Your "because" is unclear & confused, see some BCNF & lower NF definitions. 4. Assuming no non-FD JDs, this is in 5NF, as normalized (in the normal sense) as it can get. What do you mean, "before or after normalization"? 6. Adding ids leaves id values with the exact pattern of appearances as the values they identify, so adding them does not affect what "repeating data across rows" you have (whatever you mean by that). PS http://stackoverflow.com/a/26952821/3404097 — philipxy, Apr 08 '17 at 00:19
Your question is about "repeating data across rows in the attributes" but you have not explained what this is so how can we answer your question? You seem to think it's obvious, but instead you seem to hold numerous misconceptions, not only about "repeating data" but normalization and surrogates/identifiers. So if you want to not suffer from misconceptions please edit your question to be clear about your assumptions/expectations and just what is bothering you and why you think it's a problem. — philipxy, Apr 09 '17 at 22:05
If it's in BCNF, then you don't have the problem of "repeating data across rows in the attributes b, c, and d." — Mike Sherrill 'Cat Recall', Apr 11 '17 at 02:04

Joel Brown · Answer 1 · 2017-04-05T12:05:51.190

4

The point of normalization is not to remove repetition. It is to remove inappropriate dependencies. If every non-key attribute is fully functionally dependent on the primary key (and nothing else) then it doesn't matter for purposes of normalization that from one row to another in a table that some column data may be the same. That sameness is incidental.

Here is the thing you have to think about when looking at repetition and deciding whether it is incidental or meaningful. Consider the case of an update to a non-key column.

In one scenario, let's say that the non-key column is a person's name. What happens in your system when someone changes their name? If the old value is "Doug" and the new value is "Bob" do you want every instance of "Doug" to be replaced by "Bob"? Maybe you do, but I'm guessing you probably don't. If you were to create surrogate keys and normalize out the non-key value to another table then you would be incorrectly changing values that you don't mean to change.

In another scenario, let's say the non-key column is a municipality name. What happens in your system when you change a municipality name? Let's say the old value is "New Berlin" and the new value is "Kitchener". Would you want every instance of "New Berlin" to become "Kitchener"? Maybe so. (perhaps not, it depends on your business rules) = If you do want to change every instance then what you've discovered is that the municipality name may not be fully functionally dependent on your primary key. In that case you should normalize it out to a new table.

You have asked when this should happen (before or after normalization). The answer is that it happens as part of the normalization. The act of moving data off into a separate relation in order to avoid a partial or transitive functional dependency is itself the act of normalizing your database schema. Is this part of 2NF or 3NF? It depends. If your non-key attribute is partially functionally dependent on the key then it's during 2NF. If it's transitively dependent (i.e. dependent on another non-key attribute or attributes) then it's during 3NF.

You should perform normalization as part of the logical modeling process as much as possible. When you get to the physical model you are more likely to introduce denormalization for one or another of some practical reasons. Denormalization (in transaction processing systems) is something you should generally do only when you find that you have to. 3NF or higher is a good stake in the ground for OLTP systems. Therefore, you will have built your logical and your physical schemas before you start denormalizing in most cases.

edited Apr 05 '17 at 12:05

answered Apr 04 '17 at 10:27

Joel Brown

14,123
4
52
64

I agree, the question is not properly worded and I will rewrite it. Can you explain how you might handle the problem I described, independent of the normalization process? Do I create the ancillary tables and surrogate keys after normalization? Do I ignore this issue in the logical design phase and implement it in the physical design? – nicomp Apr 04 '17 at 11:32
" The answer is that it happens as part of the normalization. " -- As part of what level of normalization? – nicomp Apr 04 '17 at 21:26
1

@nicomp & JoelBrown The desire to change multiple appearances of a value to be appearances of another value all at once has nothing to do with normalization. (It's a matter of complicating a schema to simplify some queries & updates and complicate others.) Also, normalization to a given NF does not in general involve moving through lower NFs. (Doing that can make good higher-NF designs unavailable.) – philipxy Apr 08 '17 at 00:02
@philipxy I disagree with your assertion that changing one or more than one appearance of a value at a time has nothing to do with normalization. Unless we are talking at cross purposes this type of potentially harmful change is precisely what an **update anomaly** is and is one of the main points of removing transitive functional dependencies. I do however agree with you that normalization is not a sequential exercise, but rather a holistic one. – Joel Brown Apr 08 '17 at 13:26
1

@nicomp & JoelBrown Just to be clear, I was referring to introducing an id so that you can change what the id maps to instead of changing the various separate occurrences. That is a design isssue, but not a normalization issue, because normalization does not introduce ids, it replaces a table by projections of it. – philipxy Apr 08 '17 at 18:33
@philipxy "... normalization to a given NF does not in general involve moving through lower NFs." -- I'm confused : the definition of 3NF is that the relation is already in 2NF. Same for 1NF to 2NF, 3NF to BCNF, etc. – nicomp Apr 09 '17 at 12:08
1

@nicomp The definition for 3NF means that "all of the conditions for 2NF are also true, plus..." It is possible and even common to have a relation recorded out of the blue which is already in 3NF without any manipulation. Normal forms describe a state of being, not a process. – Joel Brown Apr 09 '17 at 17:20
@JoelBrown Normal forms may be a state of being, but normalization is a process. – nicomp Apr 10 '17 at 18:31
@nicomp Not withstanding the difference between nouns and verbs, what is carefully and precisely defined in the literature are the normal forms. There is no standard, agreed, inviolable process called "normalization". The point is the end state, not how you get there. – Joel Brown Apr 11 '17 at 11:13
@JoelBrown Normalization can be automated, therefore there is a standard, agreed, inviolable process. – nicomp Apr 11 '17 at 13:05
1

Normalization can indeed be automated but there are many different algorithms for that (Bernstein's and Halpin's to name just two). They all work differently and can give different results. Normalization is a set of design patterns and principles based on science, not religion. There is no single universal ritual for achieving a normalized database design – nvogel Apr 11 '17 at 16:14
1

@nicomp That confuses clauses of a definition with steps in an algorithm. PS No transitive FDs of non-primes on CKs implies no partial dependencies of non-primes on CKs. The only reason the clause "and is in 2NF" was used by Codd for his 3NF was because he wanted it to require his [1NF](https://stackoverflow.com/a/40640962/3404097). One could just say "and is in 1NF" & his 3NF still implies 2NF. Taking "in 1NF" to mean the trivial "is a relation" we get all normalization theory results re NFs that don't care about attribute types. That (modern) sense of 3NF needn't mention 2NF or 1NF. – philipxy Sep 14 '18 at 00:21

When does repeating data get removed: before/after normalization?

1 Answers1