1

According to my understanding - and correct me if I'm wrong - "Normalization" is the process of removing the redundant data from the database-desing

However, when I was trying to learn about database optimizing/tuning for performance, I encountered that Mr. Rick James recommend against normalizing continuous values such as (INTS, FLOATS, DATETIME, ...)

"Normalize, but don't over-normalize." In particular, do not normalize datetimes or floats or other "continuous" values.

source

Sure purists say normalize time. That is a big mistake. Generally, "continuous" values should not be normalized because you generally want to do range queries on them. If it is normalized, performance will be orders of magnitude worse.

Normalization has several purposes; they don't really apply here:

  • Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings

  • To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each time was independently assigned, and you are not a Time Lord.

  • In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still sucks.

source

Do not normalize "continuous" values -- dates, floats, etc -- especially if you will do range queries.

source.

I tried to understand this point but I couldn't, can someone please explain this to me and give me an example of the worst case that applying this rule on will enhance the performance ?.

Note: I could have asked him in a comment or something, but I wanted to document and highlight this point alone because I believe this is very important note that affect almost my entire database performance

Accountant م
  • 6,975
  • 3
  • 41
  • 61
  • 1
    Can you cite a credible textbook that defines database normalization as replacing text with ID numbers? (No, you can't.) And normalization never introduces new attributes. – Mike Sherrill 'Cat Recall' Apr 10 '18 at 20:39
  • @MikeSherrill'CatRecall' thank you for your feed back. I think the example in the question represents the [Third normal form](https://en.wikipedia.org/wiki/Third_normal_form) and which new attributes do you mean ? – Accountant م Apr 10 '18 at 21:52
  • 1
    The attribute `schools.id` is new. – Mike Sherrill 'Cat Recall' Apr 10 '18 at 23:50
  • Hi. You need to read a textbook. Dozens are free online, also academic slides & courses. Your characterization of normalization is neither correct nor helpful. What do your questions about those posts have to do with normalization (which does not introduce new columns)? Please make your post self-contained, a question should not rely on a link, quote what is relevant & explain how it is. How are we supposed to know what your question is even if we read the posts (which we should not have to do)? Edit your question to clearly ask whatever specific question you are trying to ask. – philipxy Apr 11 '18 at 00:33
  • 1
    PS RickJames does not seem to know what "normalization" is. They seem to think it has something to do with replacing a value by an id. They also seem to mean replacing a value by values that combine to it, which does have something to do with some notions of "1NF", but not normalization to higher NFs. Anyway 'Do not normalize "continuous" values' is too short to mean anything, it's not even clear what *continuous* means, and the phrase admits it's not clear via scare quotes; you need to google their posts more and/or leave a comment to them on a SO post to explain what they mean. – philipxy Apr 11 '18 at 00:58
  • @MikeSherrill'CatRecall' It's just the ID , anyway I removed the part of my understanding to the normalization and kept the question focused on what I want to understand. – Accountant م Apr 11 '18 at 08:36
  • 1
    *"It's just the ID . . ."* It's not *just* the ID. It's a huge, red flag that says "I don't understand database normalization." Normalization *never* introduces new attributes. There's no such thing as "over-normalization"; there are just the various normal forms 1NF, 2NF, BCNF, etc. The claim *"Sure purists say normalize time"*, which I presume means to replace a timestamp with a surrogate integer, is just nonsense. Surrogate id numbers have nothing to do with normalization. – Mike Sherrill 'Cat Recall' Apr 11 '18 at 11:06
  • 1
    *RickJames does not seem to know what "normalization" is* and *'Do not normalize "continuous" values' is too short to mean anything, it's not even clear what* continuous *means, and the phrase admits it's not clear via scare quotes*. Why still ask what the quote means? I could *guess* at the misconceptions that, combined with unclear writing, led to that quote. That tells you my guesses, not what they meant. Comment on RickJames's post to send them here. But don't trust their use of the term "normalization". [Re "1NF" & "normalization".](https://stackoverflow.com/a/40640962/3404097). – philipxy Apr 11 '18 at 17:28
  • OK, "normalization" is the _wrong term_. **What is the right term?** Then we can start the discussion over without being sidetracked by the wrong term. Some DBAs do replace a `DATE` with a surrogate ID; this is likely to cause significant performance issues when a date range is used. – Rick James Apr 27 '18 at 16:18

1 Answers1

1

The Comments (so far) are discussing the misuse of the term "normalization". I accept that criticism. Is there a term for what is being discussed?

Let me elaborate on my 'claim' with this example... Some DBAs replace a DATE with a surrogate ID; this is likely to cause significant performance issues when a date range is used. Contrast these:

-- single table
SELECT ...
    FROM t
    WHERE x = ...
      AND date BETWEEN ... AND ...;   -- `date` is of datatype DATE/DATETIME/etc

-- extra table
SELECT ...
    FROM t
    JOIN Dates AS d  ON t.date_id = d.date_id
    WHERE t.x = ...
      AND d.date BETWEEN ... AND ...;  -- Range test is now in the other table

Moving the range test to a JOINed table causes the slowdown.

The first query is quite optimizable via

INDEX(x, date)

In the second query, the Optimizer will (for MySQL at least) pick one of the two tables to start with, then do a somewhat tedious back-and-forth to the other table to handle rest of the WHERE. (Other Engines use have other techniques, but there is still a significant cost.)

DATE is one of several datatypes where you are likely to have a "range" test. Hence my proclamations about it applying to any "continuous" datatypes (ints, dates, floats).

Even if you don't have a range test, there may be no performance benefit from the secondary table. I often see a 3-byte DATE being replaced by a 4-byte INT, thereby making the main table larger! A "composite" index almost always will lead to a more efficient query for the single-table approach.

Rick James
  • 135,179
  • 13
  • 127
  • 222