1

There seems to be a misconception on what is a 'repeating group', in terms of its removal in 1st Normal Form (1NF), could any one clearly clarify what it actually is and how to identify one?

The misconception, I found, is discussed further here: https://stackoverflow.com/a/23202535/4011506. However, that has further confused me.

In regards to the following video (4:30 onwards), https://www.youtube.com/watch?v=HHDH6N_qjm4,

Aren't the Name, Address, Town, Postcode and Gender fields the repeating groups?

Rather than what the author of this video suggests (last 5 fields)?

Further, if any one is able to provide examples demonstrating the misconception of repeating groups, and what the actual answer is, that would be great as well.

Thank you

EDIT: The responses I have gotten has perplexed my confusion further... so I have another example to assist with the query:

https://hostr.co/file/970/ITiHuyVyCmPr/UNF.png

Given that unnormalised table, my text suggests that to 'remove' a repeating group, you remove the NULLs (by adding appropriate data values into the NULL fields). Resulting in this: (please replace the above link after '970' with: /6QWYWR8FtxFD/1NF.png

Thus, essentially, what is the repeating group in this case? (From what I have just read from the replies, it is NOT the fields that contain the SAME data for each instance/row, even though literally, they are essentially 'repeating', is this correct? Sort of unintuitive...)

And by the suggestion of the 'removal' of a repeating group, does one literally remove the repeating group (as delineated in the video in the original post) or does one 'remove' the NULLs (by adding values into it)? Or is it contingent on the situation...?

Community
  • 1
  • 1
Restricted
  • 63
  • 1
  • 10
  • http://stackoverflow.com/questions/23194292/normalization-what-does-repeating-groups-mean/23202535#23202535 – Damir Sudarevic Oct 14 '14 at 11:44
  • 1
    *"Aren't the Name, Address, Town, Postcode and Gender fields the repeating groups?"* No. Repeating 'M' in every row for males and repeating 'F' in every row for females doesn't constitute a repeating group. The idea of a repeating group applies to single rows, as does the idea of partial key dependencies (2NF), the idea of transitive dependencies (3NF), etc. – Mike Sherrill 'Cat Recall' Oct 14 '14 at 11:55
  • Thank you for the response Mike. O-K, if I am understanding your assertion correctly, even though a student may register for multiple courses (and thus, their name, address, town, postcode and gender) will be 'repeated' (in that sense), it does not constitute a repeating group? – Restricted Oct 14 '14 at 12:57
  • 1
    If you assume https://hostr.co/file/970/ITiHuyVyCmPr/UNF.png shows 4 tuples then the first tuple (PROJ_NAME Evergreen) has a repeating group for EMP_NUM consisting of 5 integers (101,102,103,105,106), and similarly for the other columns. If the repeating groups are removed, as in https://hostr.co/file/970/6QWYWR8FtxFD/1NF.png then there are 21 tuples each with a single integer value for EMP_NUM. – DrabJay Oct 14 '14 at 14:32
  • Thank you. Thus, the 'removal' of the repeating groups as you noted in hostr.co/file/970/6QWYWR8FtxFD/1NF.png, is essentially not removed? That is, some of the EMP_NUM values themselves you noted as a repeating group, 101, 102, 103, 105, 106, is not literally removed from the table (or transferred to another table) but rather it no longer is characterised as a repeating group? Furthermore, is the 1NF image, actually in 1NF yet? Isn't it good practice to "move" the repeating groups' fields by creating a separate table? May I ask if you could answer that as well please? Thank you for your help. – Restricted Oct 15 '14 at 00:31
  • Resolving 1NF violations is taught in two ways. One way involves flattening the table without decomposing it. That results in the outcome you showed in your example. This resolves the 1NF violation, but it creates a 2NF violation at the same time. Presumably, a later step in the normalization process resolves the 2NF violation by decomposing the tables. – Walter Mitty Oct 15 '14 at 15:42
  • The other way that's taught is to decompose the table into two tables one of which has the key of the original table but not any part of the repeating group. The other table has a composite key, being the key of the original table plus a key to the repeating group. This resolves the 1NF violation without creating any additional 2NF violations. I learned it this way, but I kinda like the other way of teaching it better. – Walter Mitty Oct 15 '14 at 15:44
  • Thank you Walter. That was greatly insightful. One thing I would like to ask though is the 2nd way to resolve the 1NF violation, in that does it really matter if the other table has the key of the original table? In other words, does the table (with the composite key) have to have the key from the original table? This is demonstrated in the video linked in the original post, with the student_no in the "other table" with the course and teacher information. That is, can't this table have existed without the student_no? I have seen this way of resolving the 1NF violation, but is it correct? Thank – Restricted Oct 16 '14 at 06:33

3 Answers3

2

As far as I'm concerned, the misconceptions go back at least as far as when I learned relational databases, in the early 1980s. My easiest conceptual handle in 1NF is that the value that lies at the intersection of a row and a column must be a simple value. However, the definition of "simple" is not all that simple (little joke).

A character string is a simple value, even though it is made up of components, namely characters. A timestamp is simple even though it is made up of a date and a time, each of which are made up of things like year, month, etc.

I treat a list of values separated by commas as a violation of 1NF, although this may simply mean that I use the informal definition.

In terms of relations and the relational data model, there is nothing in relational math to forbid the intersection of a tuple and an attribute from itself being a relation. And it was this construct that I think E.F. Codd was intending to exclude when he came up with normalization. (Later named first normal form, once the second form was discovered.)

Why did he use the term "normalization". Well, here is what he was driving at, IMO. For any defined collection of information requirements, there is a set of relational models that all express those requirements adequately. The members of that set can be regarded as "equivalent" in this sense. One can come up with a rule for picking one member of that set and calling it the "normal" representation of the entire set. The rule Codd came up with was what became the 1NF rule: no subrelations.

Another place where the concept of "normalized" had come up earlier in computing was floating point numbers. The numbers 0.1 and 1e-1 both represent the same number, and there is an uncountable number of ways of representing that same number. One of them can be called the "normalized" representation. As soon as (binary) floating point numbers began to be supported in computer instruction sets, there was a way to "normalize" a binary floating point number in such a way that the mantissa was always between one-half and one, unless the number was zero. It's not important to understand these details. My point is just that the term "normalized" was already part of computer jargon in 1970, when Codd wrote his paper.

A larger question, IMO, is "what's so bad about violating 1NF? Why is 1NF important?"

The answer to that question has two parts: one has to do with the logical model, and one has to do with the physical implementation of the first relational DBMS, which didn't come until about 1978.

In the logical model, it's desirable to say that specifying a tuple (by providing a key value) and specifying an attribute (by name) ought to be sufficient information to pin down the value being specified to a single value, at the level of detail the DBMS deals with. The early literature called this "keyed access to all data".

In the physical model, it's important to avoid having to do an entire table scan in order to do a simple search for all occurrences of a simple value. An index ought to do the trick in a few disk reads instead of maybe millions of disk reads for a table scan. To my knowledge, no DBMS ever built can create an index on the individual values stashed in a CSV text stored in a column. And Codd surely wanted to avoid the need for such an index when the first relational DBMS was to be built.

It's this last point that brings me back to the "theory is practical" motto that has been so helpful to me. Neophyte database designers come in here all the time, asking why it wouldn't be "more efficient" to store a list of course codes in the student table, instead of resorting to the "complexity" of creating a junction table with two foreign keys, studentId and courseId. The answer that convinces the neophyte usually has to do with doing table scans, and the associated delay, for operations that a good DBMS will do via an index, provided that the appropriate index has been created. IMO, that's what I think Codd was driving at, too.

Just to further muddy the waters, it's possible (at least in Oracle) to define a table that contains sub-tables in one of the columns. The question might arise whether such a database is or is not in 1NF. My answer is no.

(sorry this is so long.)

Walter Mitty
  • 18,205
  • 2
  • 28
  • 58
  • If I understand him correctly, Chris Date's opinion is that the values that lie at the intersection of a row and a column can be arbitrarily complex (like documents or movies), but the dbms either a) treats them as any other value that has no internal structure, or b) provides functions to manipulate the parts. The dbms provides functions to manipulate the parts of a timestamp, for example. – Mike Sherrill 'Cat Recall' Oct 14 '14 at 11:39
  • I think you are right. However, part b) can be interpreted two ways. On the one hand, the DBMS might provide a built in function, such as month(D), that can extract the month from timestamp D. So far, so good. Several DBMS products do this kind of thing. The second level of manipulation would permit a query spec like "where month(DateColumn) = month(3)" to be resolved by some kind of index. There may well be some DBMS that provides this kind of support for the substructure of dates. Index support for CSV structures is a whole different ball of wax. – Walter Mitty Oct 14 '14 at 18:03
  • [PostgreSQL](http://www.postgresql.org/docs/current/static/indexes-expressional.html), [Oracle](http://docs.oracle.com/database/121/SQLRF/statements_5013.htm#SQLRF01209), and DB2 support "function-based indexes" or "expression-based indexes", which allow things like `create index on your_table (extract(month from DataColumn))`. I believe they usually require you to use the same expression in the WHERE clause. An expression like `where extract(month from DateColumn) = 3` would could use the index, but `where DateColumn between '2014-03-01' and '2014-03-31'` would not. – Mike Sherrill 'Cat Recall' Oct 14 '14 at 18:43
1

A repeating group is a multi-valued construction in a database table: an attribute that has no single value but consists of multiple values that are usually accessed by positional index rather than by name. Codd's original First Normal Form eschewed tables with repeating groups in favour of relations consisting of single-valued attributes only. Most modern DBMSs don't permit multi-valued attributes so this prohibition is not especially significant.

Codd later revised/rephrased his definition of 1NF, stating that it excluded relation-valued attributes (note that he did not ever call relation-valued attributes "repeating groups" - relations are a different thing altogether). The ban on relation-valued attributes (RVAs) is a bit more controversial than the original definition of 1NF. A relation in principle is a value and can be accessed just like any other single value. For "obvious" reasons of clarity and simplicity RVAs can usually avoided. But it doesn't necessarily follow that they must always be avoided. Since RVAs are single values (unlike repeating groups) they can be subject to relational operations just like other other single values. To say that a relation ought to consist of any kind of single-valued attribute except for a relation seems unnecessarily restrictive and a bit arbitrary.

Take the example of a simple (undirected, loop-free, not multiple-edged) graph. Assume a graph is represented in a relvar with a tuple for each edge, with lt, rt attributes to identify the nodes:

G {lt,rt}

The graph is represented by the relation value of G. Now suppose we need to model an arbitrary collection of such graphs in a relational database. Clearly one graph must be distinguishable from another because it consists of a different set of nodes and edges but different graphs might also have some nodes and edges in common. Isn't the most natural representation (perhaps the only possible representation) a new relation consisting of graph values - i.e. relation values of the same type as G?

Graphs { G {lt,rt} }

This Graphs relvar has one attribute, G, which is a relation-valued attribute with heading {lt,rt}. If we wish to avoid the RVA at all costs then a possible alternative would be to extend our original G relvar by adding a new attribute to identify the graph by something other than its nodes and edges.

Gnam {graphname,lt,rt}

That may well be a useful solution but why must it be the solution under 1NF? It might be highly inconvenient to invent a new attribute just for the sake of database representation if such an attribute doesn't exist in the reality we are attempting to model. The normalization procedure does not normally require new attributes to be invented to satisfy any normal form. If RVAs are understood to be excluded by 1NF, it appears that normalization as a design procedure may be "broken" - our graph representation really cannot be normalized; only the alternative Gnam form of it can be successfully normalized. I agree that RVAs are probably undesirable in most circumstances. I'm less convinced that it's a good idea to interpret 1NF as excluding them altogether.

Just to reiterate however, RVAs are not the same thing as repeating groups.

nvogel
  • 24,981
  • 1
  • 44
  • 82
0

A relation is a set of n-tuples where the jth attribute of each n-tuple is taken from the jth domain of the relation.

First normal form is a matter of definition i.e. ensuring the database consists only of relations. So a database is in first normal form if and only if it consists only of relations

If the database is in first normal for it means that every attribute of every relation is defined over a domain. The domain specifies the set of allowed values that may assigned to the attribute and the operations that may be performed on those values.

e.g. you could define a domain StringUpper(10) which consists of between 0 and 10 characters between the values A and Z. Among the operations that may be performed on a value in this domain is SubString(Start, End) which extracts the characters beginning at Start and ending at End.

The domain may be "arbitrarily complex" such as MPEG-4 or PortableNetworkGraphic; it just needs to be ensured that any value assigned to an attribute defined over such a domain is indeed in the set of allowed values for that domain e.g. is a valid MPEG-4. The values in these domains may have specific operations that may be performed on them e.g. Play or View.

The domain may contain what some may consider repeating groups e.g. you could define a domain ListStringUpper(10) which consists of 0 or more elements where each element consists of between 0 and 10 characters between the values A and Z. Among the operations that may be performed on a value in this domain is GetElement(Index) which extracts the Indexth element in the list. (Note this is not too different from a string which is a list of characters)

However, a potential issue arises if one of the domains has itself relations as allowed values. If this is the case then instead of being able to use first-order logic to operate on the relations in the database, second-order logic must instead be used. This can make the operations more complex. In order to remove this complexity, and ensure that first-order logic can always be used, the definition of first normal form could be refined such that:

A database is in first normal form if and only if it consist only of relations and there is no domain of an attribute of a relation that itself has a relation as an allowed value.

DrabJay
  • 2,989
  • 2
  • 13
  • 12
  • My apologies however there seems to be quite sophisticated responses, which although is very insightful, have led to further confusion that is beyond me. Is there a possibility that you could rephrase that into Layman's terms in response to my original query, if not, the new example I have just added into the original post? – Restricted Oct 14 '14 at 13:11