Understanding Database Normalization - Second Normal Form(2NF)

Question

I have been learning Normalization from "Fundamentals of Database Systems by Elmasri and Navathe (6th edition)" and I am having trouble understanding the following part about 2NF.

The following image is an example given under 2NF in the textbook

The candidate key is {SSN,Pnumber} The dependencies are SSN,Pnumber -> hours, SSN -> ename, pnumber->pname, pnumber -> plocation

The formal Definition:

A relation schema R is in 2NF if every nonprime attribute A in R is
fully functionally dependent on the primary key of R.

for example in the above picture:

if suppose, I define an additional functional dependency SSN -> hours, then taking the two functional dependencies,

{SSN,Pnumber} -> hours and SSN -> hours

the relation wont be in 2NF, because now SSN ->hours is now a partial functional dependency as SSN is a proper subset for the given candidate key {SSN,Pnumber}.

Looking at the relation and its general definition on 2NF, i presume that the above relation is in 2NF

As far as my understanding goes and how i understand what 2NF is,

A relation is in 2NF if one cannot find a proper subset (prime attributes)
of the on the left hand side (candidate key) of a functional dependency 
which defines the NPA(non prime attribute).

My first question is, Why is the above relation not in 2NF? (The textbook has considered the above relation as not in 2NF)

There is, however, a informal ways(steps as per the textbook where a normal person not knowing normalization can take to reduce redundancy) being defined at the beginning of this chapter which are:

■ Making sure that the semantics of the attributes is clear in the schema
■ Reducing the redundant information in tuples
■ Reducing the NULL values in tuples
■ Disallowing the possibility of generating spurious tuples

The guideline mentioned is as follows:

My second question is, If the above steps described are taken into account, and consider why the following relation is not in 2NF, do you assume the following functional dependencies, which are,

{SSN,Pnumber} -> Pname
{SSN,Pnumber} -> Plocation
{SSN,Pnumber} -> Ename

making the decomposition of the relation correct? If the functional dependencies assumed are incorrect, then what are the factors leading for the relation to not satisfy 2NF condition?

When looked at a general point of view ... because the table contains more than one primary attributes and the information stored is concerned with both employee and project information, one can point out that those need to be separated, as Pnumber is a primary attribute of the composite key, the redundancy can somehow be intuitively guessed. This is because the semantics of the attributes are known to us.

what if the attributes were replaced with A,B,C,D,E,F

My Third question is, Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?

Because based on the data and relation state at a given point the functional dependencies can change which was valid in one state can go invalid at a certain state.In general this can be said for any non primary attribute determining non primary attribute.

The formal definition :

A functional dependency, denoted by X → Y, between two sets of
attributes X and Y that are subsets of R specifies a constraint on the 
possible tuples that can form a relation state r of R. The constraint is     
that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must 
also have t1[Y] = t2[Y].

So won't predefining a functional dependency be wrong as on cannot generalize relation state at any given point?

Pardon me if my basic understanding of things is flawed to begin with.

Re "if suppose, `{SSN,Pnumber} -> hours and SSN -> hours` this relation wont be in 2NF, because now SSN ->hours is a partial functional dependencies" This reasoning is not correct unless you add that it is a partial dependency of a non-prime attribute on a candidate key. But the FDs in isolation do not imply 2NF. — philipxy, Jun 29 '16 at 03:13
@philipxy I would be glad if you could edit the question if its at a point of not being understandable to others.... I will make sure if the edits you make conveys my question at hand. Because i been beating my head to find the answer and clarify my doubt. — xAditya3393, Jun 29 '16 at 03:45
But since SSN is a proper subset to the candidate key, assuming {SSN, Pnumber} is a candidate key for the above relation, wont there a case where Pnumber may be duplicate entries on the process when {SSN, Pnumber} functionally determine hours ? — xAditya3393, Jun 29 '16 at 03:46
Could you tell me which parts you didn't understand/were vague so that i could edit it myself — xAditya3393, Jun 29 '16 at 03:59
I can't understand anything after "Why is the above relation not in 2NF?". Use proper terms, shorter sentences, more sentences and make sure that every time you refer to something that your text is clear about what thing and what properties of it you are talking about. Start with your first question. And in my comment above about "if suppose", you don't give all your reasoning for your "because" so I don't know whether you are confused or not. (I still don't know.) — philipxy, Jun 29 '16 at 04:36
Re your "But" comment: I don't understand "Pnumber may be duplicate entries on the process". Can you make that more clear? Again: Look at what is in front of you: predicates & rules/restrictions on situations, giving restrictions on databse/table values, plus definitions/procedures. Use official terms to talk about things and their parts. You haven't even said what schema you are talking about. Or what claim your "but" applies to. Can there be multiple rows with the same Pnumber? Yes. But per the CK they will have different Ssns and per FD3 they will have the same Pname & Plocation. — philipxy, Jun 29 '16 at 05:13
1. The textbook definition is wrong. 2. Adding SSN -> hours means the CK might change so the "given" CK is irrelevant. 3. Your reasoning re "SSN -> hours" as partial FD already applies to {Ssn} & {Pnumber) as determinants of partial FDs in the original schema. 4. Your paraphrase of 3NF is garbled. Eg the LHS of a FD is a determinant. A CK is a column set that functionally determines every attribute but doesn't contain a proper subset that does. 5. The informal steps are useless for normalizing; they just give some informal vague motivation/effect. (If "the above steps" are the bullets.) ... — philipxy, Jun 29 '16 at 08:27
... 6. Which relation is "the following" one? You mean "the following" schema? And do you mean "the one following" from appylying the bullets? And does that mean the given one? But those informal steps are *not* how the schema arose. It arose from a. noticing that if exactly the FDs implied by the given ones hold in every table value that can arise from using the DBA-given *predicate* to fill the table and b. using those FDs to find the given set of CKs. 7. Your 2nd question is still unintelligible. All these things are in my answer already. — philipxy, Jun 29 '16 at 08:41
1. Your edits are much clearer but there are still problems of clarity including what things you are referring to. 2. Consider asking a series of questions, as you get answers, for each sub-question or claim in your current question. For a claim, ask the question "Am I correct that ... ?". Because your "question" is really a bunch of questions and confusions. PS Attribute names are irrelevant. The designer gives a *predicate* and *business rules*; then they determine FDs from what table values can arise; they apply the formal normalization procedure using the FDs. — philipxy, Jun 29 '16 at 08:46
One final question, because a primary key can determine a non prime attribute .... would you only consider them as functional dependencies if stated under the list of FD's for a relation? — xAditya3393, Jun 29 '16 at 09:20
Your last comment's question is confused: 1. PKs don't matter. Forget about them. A PK is a CK you call PK. 2. CKs are not FDs. 3. Prime attribute = attribute of some CK. 4. When you are told some FDs hold, *certain other ones also hold*. 5. So: Find/receive some FDs that hold & don't hold. Then certain rules give all FDs that must also hold when the FDs that hold hold. Then after determining for every FD (set -> set) whether it holds you determine the CKs per the FDs that hold & don't hold. Then you have 2NF if and only if no non-prime attribute is partially dependent on a CK. — philipxy, Jun 29 '16 at 17:00
@philipxy Thanks for summarizing the whole process .... That one comment cleared lot of my problems :D — xAditya3393, Jul 03 '16 at 03:22
@philipxy If you notice the first line above and below the defintion of 2NF ... The thing that confused me was whether to consider the FD's mentioned in the first image or the one's you need to assume {wherein FD's on each primary key determining a Non primary key} is to be taken into account for 2NF test — xAditya3393, Jul 03 '16 at 03:42
I don't know what you're trying to say in the braces. (Maybe by "you need to assume" you mean "implied".) To show 2NF *all* FDs that hold must be taken into account. The texbook *could* be clear & say "in these example the only FDs that hold are the given ones, the ones that hold because of the given CK(s) and the ones that are implied by those (ie that follow from those, ie that must also hold when those do)." But to show that 2NF is *violated* you only need to find 1 bad FD. So here it's enough to know the CK & either FD2 or FD3. It doesn't matter what others hold. Why do you think it does? — philipxy, Jul 03 '16 at 05:18
I didn't realise you had to consider all those implied FD's too into account which clears everything up... as in if mt R{abc}, CK {ab} and one of my FD is b->c is partial because a->b which is implied is to be taken into account — xAditya3393, Jul 03 '16 at 07:28
Because Candidate key set wasn't specified as the example was given... I feel if only things were mentioned as "candidate key" over "primary key" on the line below the definition, things would been 100% clear — xAditya3393, Jul 03 '16 at 07:28
I was also under a misconception as to FD's being determined/explicitly stated (or given in a question) are the only one's taken into account, not realising that as a CK is defined, which may be composite as in {ab} , a->b and b->a FD's are implied — xAditya3393, Jul 03 '16 at 07:33
Re *"I didn't realise you had to consider all those implied FD's too"* I just said in the comment before you wrote this that to show it's *not* in 2NF you *only* need the CK and the partial dependency. You only need all FDs to show it *is* in 2NF. — philipxy, Jul 03 '16 at 08:37
Re *"R{abc}, CK {ab} & one of my FD is {b}->c"*: Your *"because"* is wrong. First, {a->b} & {b->a} are not implied. Second, they are irrelevant anyway. Given the quote, c is functionally dependent on a subset of {ab} (namely {b}), so c is partially functionally dependent on {ab}. Since c is non-prime attribute & {ab} is a CK there is a non-prime attribute partially functionaly dependent on a CK, ie the schema is not in 2NF. *Read a definition of 'partial FD' or 'full FD'.* (An FD that's not partial is full, and vice versa.) PS Another example of why you should ask one question per question. — philipxy, Jul 03 '16 at 08:44

philipxy · Accepted Answer · 2019-06-19T20:27:48.187

Why is the above relation not in 2NF?

Your original/first/informal "definition" of 2NF is garbled and not helpful. Even the quote from the textbook is wrong since 2NF is not defined in terms of "the PK (primary key)" but rather in terms of all the CKs (candidate keys). (Their definition makes sense if there is only one CK.)

A table is in 2NF when there are no partial dependencies of non-prime attributes on CKs. Ie when no determinant of a non-prime attribute is a proper/smaller subset of a CK. Ie when every non-prime attribute is fully functionally dependent on every CK.

Here the only CK is {Ssn, Pnumber}. But there are FDs (functional dependencies) out of {Ssn} and {Pnumber}, both of which are smaller subsets of the CK. So the original table is not in 2NF.

If the above statement is taken into account, do you assume the following functional dependencies

so won't the same process of the decomposition shown based on the informal way alone be difficult each time such a case arrives?

A table holds the rows that make some predicate (statement template parameterized by column names) into a true proposition (statement). Given the business rules, only certain business situations can arise. Then given the table predicates, which give table values from a business situation, only certain database values can arise. That leads to certain tables having certain FDs.

However, given some FDs that hold, we can formally use Armstrong's axioms to get all other FDs that must also hold. So we can use both informal and formal ways to find which FDs hold and don't hold.

There are also shorthand rules that follow from the axioms. Eg if a set of attributes has a different subrow value in each tuple then so does every superset of it. Eg if a FD holds then every superset of its determinant determines every subset of its determined set. Eg every superset of a superkey is a superkey & no proper subset of a CK is a CK. There are also algorithms.

Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?

When normalizing we are concerned with the FDs that hold no matter what the business situation is, ie what the database state is. Each table for each business can have its own particular FDs per the table predicate & the possible business situations.

PS Do "make sense" of formal things in terms of the real world when their definitions are in terms of the real world. Eg applying a predicate to all possible situations to get all possible table values. But once you have the necessary formal information, only use formal definitions and procedures. Eg determining that a FD holds for a table because it holds in every possible table value.

so would any general table be in 2NF based on a solo condition of a table having a composite primary key?

There are tables in 5NF (hence too all lower NFs) with all sorts of mixes of composite & non-composite CKs. PKs don't matter.

It is frequently wrongly said that having no composite CKs guarantees 2NF. A table without composite keys and where {} does not determine any attribute is in 2NF. But if {} determines an attribute then it's a proper/smaller subset of any/every CK with any attributes. {} determines an attribute when every row has to have the same value for that attribute.

I have edited the question with the formal definition and have shown an example on partial dependency — xAditya3393, Jun 29 '16 at 02:20
@WSimpson was pointing something similar to your argument, so could you clarify the question which i posted for him, Then am i correct in assuming if a table has a composite primary key, regardless of the functional dependencies is not in 2NF? — xAditya3393, Jun 29 '16 at 02:22
Because the whole point of normalization is to reduce redundancy based on functional dependencies, so would any general table be in 2NF based on a solo condition of a table having a composite primary key? — xAditya3393, Jun 29 '16 at 02:23
I have been taught upto BCNF and while reading the textbook i found this example which i was beating my heads on ... which lead to these questions :D ... i will look into 4NF and 5NF .. Thank you :) — xAditya3393, Jun 29 '16 at 02:34
Sometimes to learn the abstract you need multiple examples of real world applications. The whole point is to learn how to accurately define a model that can be applied to multiple real world scenarios. If the person trying to understand the model doesn't have a full understanding or has a misunderstanding of the terms being used. Then a real world or informal explanation is required to ensure that they can come to understand the correct meanings for the formal terms. — WSimpson, Jun 29 '16 at 02:41
@WSimpson The formal definitions apply *given* the FDs and attributes. That is what should be applied to normalize. (Formal examples help to teach that.) The real world scenario determines the FDs informally. Learning to do that is helped by examples involving the real world. It's helpful to learn the connection between the formal normalization notions & process vs the informal. But informal notions do not "convey" exactly what to do unless they are mapped appropriately to formal ones. — philipxy, Jun 29 '16 at 02:51

James K. Lowden · Answer 2 · 2016-06-29T02:35:31.590

Why is the above relation in 2NF?

EP1, EP2, and EP3 are in 2NF because, for each one, the key identifies the non-key. No part of any key identifies any part of any non-key. That is what is meant by for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].

By contrast, you might say EMP_PROJ is over-specified. If ssn identifies, ename (as the text says it does), then the combination of {ssn, pnumber} is too much. There exists a subset of the key {ssn,pnumber} that identifies a part of the non-key, {ename}. That situation does not occur in a table conforming to 2NF, as EP1, EP2, and EP3 illustrate.

Are functional dependencies ... based on ... domain knowledge of the attributes?

Emphatically, yes! That's all they're based on. The DBMS is just a logic machine. The ideas of "employee" and "hours" don't exist for it. The database designer chooses to define tables that model some real-world universe of discourse, and imposes meaning on the columns. He gives names to the attributes (above) in X and Y. He decides which columns serve to identify a row based on what is true about the universe being modeled.

if a table has a composite primary key, regardless of the functional dependencies is not in 2NF?

No. Remember, 2NF is defined in terms of FDs. What could it mean to speak of conforming to 2NF "regardless" of them?

The number of columns in the key is immaterial. It's some set, X, identifying the complement, Y.

Sorry i missed the "not" in the first question while typing and i was trying to challenge people if they said any table with composite primary key is not in 2NF :D .. sorry about that .... I have added a formal definition and stated a partial functional dependency in my example if it helps clear my question. — xAditya3393, Jun 29 '16 at 02:25
I expect that you understand the topic but what you wrote has some problems. Eg "EP1, EP2, and EP3 are in 2NF because, for each one, the key identifies the non-key." No, because a CK *always* identifies non-CK (and CK) attributes. Eg "No part of any key identifies any part of any non-key. That is what is meant by for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y]." No, the 1st sentence is about 2NF but the 2nd sentence is about FDs. — philipxy, Jun 29 '16 at 02:56

score 1 · Answer 3 · answered Jun 29 '16 at 02:52

I'm not sure if I thoroughly understand your questions, but I'll give a try to explain.

Your first statement about 2NF:

a relation is in 2NF if one cannot find a proper subset on the left hand side of a functional dependency which defines the NPA

is correct, as well as your supposition

if {SSN,Pnumber} -> hours and SSN -> hours then this relation wont be in 2NF

because what that means that you could determine 'hours' from 'SSN' alone, so using the composite key {SSN,Pnumber} to determine 'hours' will be redundant, and thus violates the 2NF requirements.

What you call the left hand side of an FD is usually called a key. You use the key to find the related data. In order to save space (and reduce complexity), you should always try to find a minimal key, and break up larger tables into smaller ones if possible, so you do not have to save information in more places than necessary. This is what normalization to the normal forms is all about, and being studied for about half a century now, substantial theory on the matter has been developed, and some rules chrystalized from it, like 1NF, 2NF, 3NF etc.

Your second question confuses me a lot, because from what you are saying, it seems you already understands this. Could there be some confusion about the FD's? From the figure, it seems to me as they are defined like this:

{SSN,Pnumber} -> hours
{SSN} -> ename
{Pnumber} -> Pname,Plocation

Just like the three lower tables are modeled, together they add up to the relation (table) modeled above. So, in the first table, you would need the composite key {SSN,Pnumber} to access any data in the relation (search in the table), while that clearly is not necessary for most of the fields.

Now, I'm not sure about what purpose that table would fulfill in real life. While that is not formally necessary, as long as the FD's are given, it might be easier to imagine why the design will benefit from normalization.

So let's day it's about recording workhours per emplyee per project in some organization. SSN identifies the employee, (whose name also is stored as ename because it is easier to remember, but could be duplicate), Pnumber identifies the project, which name and location is also stored much for the same reason.

Then if you as a manager need to register that an employee worked another few hours on some project, you would use your manager app on your device, which in turn will update the tables seamlessly (you cannot expect managers to understand the logics of normalization)

Behind the scenes, however, it would amount to some query, in SQL that would be an 'INSERT' statement which added another row to the relevant table(s).

Now you can see that in the above table, you would have to insert all the six attributes, while with the normalized tables below, you will only need to add a row to table EP1,consisting of three attributes. In a large organization with thousands of employees delivering their worksheets every week, that will quickly become huge differences in storage requirements. That has a number of benefits, perhaps the most significant beeing search speed.

Your third question I don't understand at all, I'm afraid. In a way you could say FD's are predetermined once you have decided what data you will save in your database. The FD's are not dupposed to change. When modeled in the DB, they will not change. If you later find you will alter the design, then that will be new relations with new FD's.

The text you seem to be quoting from somewhere simply says that if you have the FD X -> Y (X gives or determines Y) then if you have any two tuples (records) in that relation (table) that have the same value of X, they must also hve the same value of Y. Or in our example, if Pnumber somewhere is given the value of 888, Pname is 'Battleship' and Plocation is 'Kitchen Sink', then if somewhere else (some other record) the Pnumber 888 is used then also there Pname must be 'Battleship' and Plocation must be 'Kitchen Sink' because Pname and Plocation is functionally dependant on Pnumber.

Now that was almost another chapter in your textbook, or what? Hope it helps, because it took me some time to write :-)

The LHS of a FD is a "determinant". Only some FDs have determinants that are CKs. — philipxy, Jun 29 '16 at 03:00
thank you for taking the trouble of understanding the questions .. i have made the edits .. could you check the question once more? — xAditya3393, Jun 29 '16 at 07:13
Hi, it seems you've got a lot of qualified help here already. If you would like me to answer a specific question, please enter it in another comment. Good luck with your course! — Marius, Jun 29 '16 at 12:52

WSimpson · Answer 4 · 2016-06-29T02:36:23.810

0

A table can be said to be in 2NF, if the primary key is composed of multiple columns, and that if for each row these columns were concatenated together into a single string, then the resulting column would qualify as the primary key. Alternatively a single column primary key will also qualify as 2NF.

In this case the same employee could have multiple phone numbers (PNUMBER), so a you cannot have a compound primary key that includes the phone number.

edited Jun 29 '16 at 02:36

answered Jun 29 '16 at 01:25

WSimpson

460
4
7

Then am i correct in assuming if a table has a composite primary key, regardless of the functional dependencies is not in 2NF? – xAditya3393 Jun 29 '16 at 01:43
Just to note, I wouldn't divide the tables up the way they did. From what you have above; ssn, pnumber, and hours are combined into one table while ssn and ename are in another. What should probably happen is that there should be a shifts table, that has what the available values for hours are. Then the SSN should be mapped to ename and hours in one table. The ssn and pnumber in another and the last table could be pnumber, pname and plocation. Although without more context, I am uncertain as to the purpose of this last table. – WSimpson Jun 29 '16 at 02:17
@ WSimpson & @PagadalaVikramaditya `. Normalization does not add columns. 2. Ssn should *not* "be mapped to ename and hours in one table". Since {Ssn}, a proper subset of CK {Ssn, Pnumber}, determines ename, that table wouldn't be in 2NF. Apply the formal definitions yourself to see this. – philipxy Jun 29 '16 at 03:09
Actually the way I read this, the hours is defining the shift, which means that that it is really the start and end values for the shift, which means the column hours is actually a table of two columns. Thus the ssn, pnumber, hours table is not 1NF. – WSimpson Jun 29 '16 at 03:37
@ WSimpson (& @PagadalaVikramaditya) You have been given attributes and FDs. At that point the process is formal and it doesn't matter what the table means. What you propose contradicts what you have been given. That design is fine. Eg your "which means..." re hours is wrong. It could be total hours, start time or finish time and be functionally dependent on that CK. You don't seem to understand what normalization is. Re "1NF" and "atomic" (about which *much* nonsense is written) see [this](http://stackoverflow.com/a/24038895/3404097). – philipxy Jun 29 '16 at 04:51

Understanding Database Normalization - Second Normal Form(2NF)

4 Answers4