One way to express the 3NF is:
All attributes should depend on the key, whole key and nothing but the key.
The transitive dependency X->Y->Z violates that principle, leading to data redundancy and potential modification anomalies.
Let us break this down:
- By definition, for a functional dependency X->Y->Z to also be transitive, the X<-Y must not hold.
- If Y was a key, the X<-Y would hold, so Y cannot be a key. (FOOTNOTE1)
- Since Y is not a key, any given Y can be repeated in multiple rows.
- The Y->Z implies that all rows holding the same Y must also hold the same Z. (FOOTNOTE2)
- Repeating the same (Y, Z) tuple in several rows does not contribute any useful information to the system. It is redundant.
In short, since Y is not a key and Y->Z, we have violated the 3NF.
Redundancies lead to modification anomalies (e.g. updating some but not all of the Zs "connected" to the same Y essentially corrupts the data, since you no longer know which copy is correct). This is typically resolved by splitting the original table into two tables, one containing {X, Y} and the other other containing {Y, Z}, This way, Y can be a key in the second table and Z is not repeated.
On the other hand, if the X<-Y does hold (i.e. X->Y->Z is not transitive), then we can retain a single table, where both X and Y are keys. Z won't be unnecessarily repeated in this scenario.
(FOOTNOTE1) A key is a (minimal) set of attributes that functionally determine all of the attributes in a relation. Rationale: If K is a key, there cannot be multiple rows with the same value of K, so any given value of K is always associated to precisely one value of every other attribute (assuming 1NF). By definition (see FOOTNOTE2), "being associated to precisely one" is the same thing as "being in a functional dependency".
(FOOTNOTE2) By definition, Y->Z if, and only if, each Y value is associated with precisely one Z value.
Example:
Assuming each message has exactly one author and each author has exactly one primary e-mail, attempting to represent messages and users in the same table would lead to repeating e-mails:
MESSAGE USER EMAIL
------- ---- -----
Hello. Jon jon@gmail.com
Hi, how are you? Rob rob@gmail.com
Doing fine, thanks for asking. Jon jon@gmail.com
(In reality, these would be MESSAGE_ID
s, but let us keep things simple here.)
Now, what happens if Jon decides to change his e-mail to, say, "jon2@gmail.com"? We would need to update both of Jon's rows. If we only update one, then we have the following situation...
MESSAGE USER EMAIL
------- ---- -----
Hello. Jon jon2@gmail.com
Hi, how are you? Rob rob@gmail.com
Doing fine, thanks for asking. Jon jon@gmail.com
...and we no longer know which one of the Jon's e-mails is correct. We have essentially lost the data!
The situation is especially bad since there is no declarative constraint we could use to coerce the DBMS into enforcing both updates for us. The client code will have bugs and is probably written without much regard for complex interactions that can happen in the concurrent environment.
However, if you split the table...
MESSAGE USER
------- ----
Hello. Jon
Hi, how are you? Rob
Doing fine, thanks for asking. Jon
USER EMAIL
---- -----
Jon jon@gmail.com
Rob rob@gmail.com
...there is now only one row that knows about Jon's e-mail, so ambiguity is impossible.
BTW, all this can be viewed as just another expression of the DRY principle.