Should I normalize or not? If yes how?

Question

Currently I have a table with a column containing CSVs. I am not sure whether to normalize the whole table or not. The problem is this column, configuration, may contain up to 50 or more different types of values. For example in the table shown below it's 18, 20, but for other data in the same column it may be 0, 20, 21, 22, 23, 25, 26, 27, 40, 52, 54, 55 and so on, however these values are unique. They will never repeat.

I do not know what is the maximum number for it(it may vary) so that's why I kept it in CSV. I am currently have trouble normalizing it, or rather I am not sure whether I should even normalize it. Any help here?

id    tester_type    device_id      board_id        configuration
75946   UFLEX           997           220   
44570   UFLEX           450           220               18,20
44569   UFLEX           449           220               18,20
44568   UFLEX           448           220               18,20
44567   UFLEX           447           220               18

Note: Configuration column does also contain empty values or empty spaces.

Here is the simple rule: if the value will *ever* be accessed in SQL (eg. searched, filtered, grouped) *or* it participates in relation then it should be normalized - if none of those things are ever done then (and only then) can it be treated as an *opaque* value. That's the end of "should I" question.. which is most always: *yes*. — user2864740, Jun 22 '15 at 03:32
Well, it's a jaywalking anti-pattern, so you should normalise it ... unless you're absolutely sure that you will never have to query against it. — Ja͢ck, Jun 22 '15 at 03:32
Now for the 'how to' part: consider a table `(test_id, configuration)` (which is the candidate key itself). — user2864740, Jun 22 '15 at 03:34
I see. And to @Ja͢ck yes I do have to query against it so I guess I have to normalize it. Now back to @user2864740 if I create that new table should I still leave `configuration` with CSVs? I'm not sure whether it'll be suitable. The configurations may vary like `1,20` or `1,30` or even `1, 30, 50` and so on. If do create the table, does that mean I have to create for every possible configuration value? — kross, Jun 22 '15 at 03:37
Typical normalisation involves creating an m:n relationship, so a configurations table (with one row for each configuration value) and a join table. — Ja͢ck, Jun 22 '15 at 03:41
Does this answer your question? [Is storing a delimited list in a database column really that bad?](https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad) — philipxy, Feb 02 '22 at 05:59

Ja͢ck · Accepted Answer · 2015-06-22T04:01:11.340

0

I do have to query against it so I guess I have to normalize it.

Yes, you do :)

If do create the table, does that mean I have to create for every possible configuration value?

An example of a normalised structure would be:

join table
==========
test_id configuration_id (spanning unique constraint)
------- ----------------
44570   18
44570   20
44569   18
44569   20
44569   20
44568   18
44568   20
44567   18

configurations table
====================
configuration_id
----------------
18
20

If you're using InnoDB, each column of the join table is also a foreign key to their respective parent tables.

edited Jun 22 '15 at 04:01

answered Jun 22 '15 at 03:53

Ja͢ck

170,779
38
263
309

Hmm I see, but what if let's say `test_id` has 40 different configurations, so there would be 40 more `test_id` rows. Is that okay? I'm just afraid there would be too many rows. I'm using the MyISAM engine, not sure if that affects anything. Other factors that I'm considering are like whether there would be enough storage, this is on a linux server not sure if it affects anything. It's my first time handling such a huge scale database so I just want to make sure, still new to a lot of these things. – kross Jun 22 '15 at 04:00
@hzq It really depends on the number of rows we're talking about; it's not a problem unless certain performance benchmarks aren't met, or when you run out of space, or whatever. – Ja͢ck Jun 22 '15 at 04:03
@hzq http://stackoverflow.com/questions/2716232/maximum-number-of-records-in-a-mysql-database-table - don't invent problems that [probably] don't exist [for this application]. – user2864740 Jun 22 '15 at 04:23

score 0 · Answer 2 · answered Jun 22 '15 at 21:15

I disagree with both "must" and "must not" normalize stands. My 2 cents:

Do not normalize "continuous" values such as prices, numbers, dates, floats, etc.
Do not normalize values that are unique or nearly so.
Do not normalize fields that are narrow. For example, don't replace a 2-letter country code with a 4-byte country_id.
"Normalize for simplicity": Do normalize things that are used in multiple tables and are subject to change. Sometimes names, addresses, company names, etc fall into this category. This is so you can change the value in exactly one place, not lots of places.
"Normalize for space": Do normalize things that would save a significant amount of overall space for the dataset. (This applies to gigabyte tables much more so than to kilobyte tables.)
Normalize, but don't "over-normalize". You will figure out what I mean when you have over-normalized and a nasty JOIN can't be optimized.

If you would like further specific advice, let's see SHOW CREATE TABLE and sample values for any un-obvious columns.

Should I normalize or not? If yes how?

2 Answers2