2

I'm a beginner with SQL.

In the first normal form (1NF) of SQL tables, one shouldn't have repetition of values and columns should have atomic values.

I have a table where I have columns StartDate | EndDate | Duration.

StartDate has value 01/01/2000 and EndDate 03/03/2003, so the duration is 2 dates, 01/01/2000 - 03/03/2003, using the same two values that are in the StartDate and EndDate fields.

Do I get rid of the Duration column completely and assume that one is able to find the info with queries, or is there a way to normalize this to follow the rules without losing any data?

I noticed this problem in the 3NF point when I separated these 3 columns to another table from the other information, but I reckon this should be done way before that?

philipxy
  • 14,867
  • 6
  • 39
  • 83
Joonas
  • 23
  • 6
  • does you have duplicates in "Duration" field? – Evgeny Apr 18 '17 at 08:56
  • You are right, you are storing redundant information. You should get rid of either EndDate or Duration (or even StartDate, but that would look a bit silly ;-) All DBMS have some date functions to get the duration from two dates or the other date from a date and a duration. – Thorsten Kettner Apr 18 '17 at 09:03
  • Not duplicates per se, but the same two values that are in StartDate and EndDate fields. I'll clarify the question a bit! – Joonas Apr 18 '17 at 09:03
  • @Joonas, you could have a clear ultimate goal. Which column do you want to make unique. Can you say what problem stops at the moment and give an example? – Evgeny Apr 18 '17 at 09:10
  • I would get rid of the duration column, or - depending on the DBMS you are using - make it a computed column. Calculating that value is typically quite cheap, so there is no real point in storing it. –  Apr 18 '17 at 09:11
  • @ЕвгенийКондратенко I'm just trying to understand the normalization process, I'm working with small tables in excel and normalizing them so the question isn't really what works in a database and what doesn't. It's about what is the correct way to do it. I'll remove the field, it seems like the best answer. – Joonas Apr 18 '17 at 09:20
  • 1
    What does "repetition of values" mean? What does "the 3NF point" mean? And, "way before that" in what process? Please explain yourself instead of using shorthands that fit but do not explain what you are thinking about. – philipxy Apr 18 '17 at 09:46
  • 1
    Is Duration a (start, end) pair? Or is a *distance* between two timeline points, ie it can be calculated as a difference but does not determine endpoints? – philipxy Apr 18 '17 at 10:26
  • PS When you say it's "two dates" (suggesting the former), maybe you mean it's "two days (long)" (suggesting the latter)? – philipxy Apr 18 '17 at 10:35
  • @philipxy sorry for a bit un-clear notation, it's a dash, not a calculation. That's where the problem arises because I have that same data in the other fields and the value in duration field is equal to a query **Select StartDate ||","|| EndDate** And related to 3NF and before that; I'm doing the normalization one form at the time (1NF, 2NF, 3NF, BCNF), so I have all the interphases of the tables in excel from unnormalized to BCNF form and I noticed the problem near the end. – Joonas Apr 18 '17 at 16:19
  • See my edited answer PS 2 re how you should *not* normalize by moving through normal forms to the highest wanted. Also, please edit your question to clarify re "duration" (perhaps as suggested in my comments) and "normalization one form at [a] time"; comments are not for clarification & are ephemeral. – philipxy Apr 19 '17 at 04:54

3 Answers3

0

If StartDate and EndDate are just the dates from Duration, then it functionally determines them, and the two of them together functionally determine it. So normalization (preserving FDs (functional dependencies)) to a high enough normal form will make you separate the three of them into a separate table.

But clearly it is enough to have either {Duration} or {StartDate, EndDate} as a subset of your columns. If you add another of the attributes, it's redundant. Normalization can't help with that, because it only removes redundancies where a table can be replaced by projections of it that join to it.

Generally speaking it is better to store interval end points. The general idea in relational design is to have a column for any part of some thing that you might want to query about separately. Learn about computed/calculated columns if you want to control this kind of redundancy.

PS 1 See this answer re 'atomicity'.

PS 2. Normalization is not done by moving through normal forms to the highest wanted. (Which should be 5NF, then for certain reasons one can denormalize. Normalizing to lower normal forms can exclude good higher-level designs from arising. Find the algorithms for normalizing to 3NF or BCNF.

Community
  • 1
  • 1
philipxy
  • 14,867
  • 6
  • 39
  • 83
  • 1
    Having **only** the Duration is something different then having only StartDate and EndDate. With only the Duration you lose information. –  Apr 18 '17 at 09:59
  • If Duration were a duration yes, but the op says that Duration is a start-end pair. Or maybe not. I guess you mean, that's a minus sign not a dash. – philipxy Apr 18 '17 at 10:23
  • The duration is defined as the difference between enddate and startdate according to the question. –  Apr 18 '17 at 10:26
  • As I just said, you've seen "-" as a minus sign, but per the op's edit "so the duration is 2 dates, 01/01/2000 - 03/03/2003" and comment "Not duplicates per se, but the same two values that are in StartDate and EndDate fields" (which I edited into the question) it seems it's a dash. – philipxy Apr 18 '17 at 10:29
  • Hmm, I read that as "the duration is 2 days" (assuming a mixup between "date" and "day"), but reading that again your interpretation might be correct as well. –  Apr 18 '17 at 10:32
  • I added an explicit comment on the question. – philipxy Apr 18 '17 at 10:33
  • But then, storing two dates in a single column would would already violate 1NF –  Apr 18 '17 at 10:34
  • "1NF" as defined [sic] in terms of the fuzzy unclear notion of "atomicity", which is in the eye of the beholder, is irrelevant to the problems normalization to higher normal forms is intended to solve. (See the link in my answer.) – philipxy Apr 18 '17 at 10:44
  • Answer accepted, I agree that as you said, it's enough to have start and end dates or duration, but for the sake of queries, using 2 fields for 2 values is better so I get rid of the duration field. It may have been kind of an unintentional conundrum in the materials I'm reading. – Joonas Apr 18 '17 at 16:23
0

I don't familiar with business rules and requirements in your domain, but for example, if we know that duration always the same - we can delete EndDate field, or in the case of boxing rounds (which is always 3 minutes) - we can delete EndDate and Duration! But if your project is high loaded and performance is drastically important for you - you might consider an option of micro-optimization and have all 3 fields, just with the purpose to win some microtime.

My point here - is no one able to say that have all 3 fields (StartDate, EndDate or Duration) in table it's bad approach.

cn007b
  • 16,596
  • 7
  • 59
  • 74
  • It's true that normalization never drops columns from a design. But having the three of them along with other columns in the same table can (and typically does) violate various normal forms, so that decomposition/normalization is needed. – philipxy Apr 18 '17 at 10:04
  • 2
    The question **is** about normalization. In this case storing the `duration` (as the difference between enddate and startdate) violates 2NF which says: "All non-key attributes are dependent on the key, the whole key and nothing but the key*". And the duration column would not **only** depend on the primary key, but also on the startdate and enddate column. –  Apr 18 '17 at 10:30
  • @a_horse_with_no_name Yes, you are right, my bad. But in case we definitely need all those 3 fields - I wouldn't have table `durations` even if it violates any normal form... – cn007b Apr 18 '17 at 11:08
0

I must admit that I never cared much about the normalization forms, because in order to avoid redundancies and inconsistencies in a database you apply them kind of automatically without thinking 1NF, 2NF, etc.

With your current table design (3 columns: StartDate, EndDate, Duration, and maybe others) 1NF is met, because you can neither split StartDate nor EndDate nor Duration into meaningful parts.

2NF is about the primary key if I am not mistaken. And 3NF is the one that we violate with the given table design, I guess. It's because if we had two records with the same StartDate and EndDate they'd have the same Duration, so this field doesn't depend on some primary key alone. The same can be said of StartDate and EndDate. Three columns hence, where we can remove one to fulfill the 3NF requirement. You are free to choose.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73