Database architecture, majority null fields in a column

Question

I have a database table in MySQL, according to a new feature, we can implement in two ways - 1. Either make a new column (nullable) in the same table itself, the con of this approach is - this column will have 95-98% of the times NULL entry. 2. Make a new table with the foreign key of the already existing table.

so the two architecture will look something like this -

1. table1 - <id, ..., new_column>

2. table1 - <id, ...>, table2 - <id, table1_id, ...>

The first approach follows a denormalized approach, while the second one follows a normalized one. But since this is a real-world problem, it is okay to follow the denormalized approach sometimes.

I might be wrong in some of my assumptions of DB design, what do you think is a better approach to solve such kind of problems?

You're really asking several things here. [See here](https://stackoverflow.com/questions/229179/null-in-mysql-performance-storage) for a discussion on null storage in MySQL. For InnoDB, there may not be any storage penalty at all for storing nulls. As to your table design, you should probably be normalizing your data. I would doubt that a normalized approach would end up paying a bigger storage penalty than an unnormalized approach. — Tim Biegeleisen, Aug 22 '18 at 09:01
Your two approaches are not equivalent unless `table1_id` is unique and not null, so basically you could (and should) use it as the primary key of table2 instead of `id`. And in that case, both approaches are correct *and normalized* implementations of a 1:1 relation! It comes down to personal preference which version you choose. — Solarflare, Aug 22 '18 at 09:33
Thanks for the reply, the thing that I am concerned with - cost of denormalized approach vs join of 1:1 normalized relation between two tables. what do you suggest is better. — Sajal Sharma, Aug 22 '18 at 10:03
As I said: this is not a denormalization per sé. The only real costs would be the cost of a (left) join compared to the cost of storage space, and the cost of "I cannot stand to look at scenario 1/2" (aka personal preference or corporate guidelines). You may also consider if that feature is generally optional or, if you reach e.g. twice the current user count, it will get to 50% occupancy eventually, at which point a 1:1 table with a single column and 50% rows might look as weird as a 2% column looks now. — Solarflare, Aug 22 '18 at 10:46

score 1 · Answer 1 · answered Aug 22 '18 at 10:19

It would be really helpful if you could provide specific examples - "should I add a column that may be null" isn't easy to answer.

In very general terms, normalize until you can prove you have to do something else. Design your database for legibility and bug-resistance; adding an extra table is much less effort than working out why on earth your application suddenly reports incorrect data in 12 months when you change a bit of code that accidentally forgets about your denormalization.

So, is this nullable column an attribute of the entity? Not all people have a middle name attribute - perfectly reasonable to have a nullable column. Or is it something that you're just attaching to the entity because it's convenient, but isn't really an attribute?

For instance, a person may have an employer, and that employer may have an address; ideally, you'd create an employer table, with an address attribute; attaching employer_address to person might feel like a shortcut (I don't care about anything other than the address - I never need to know how many people work for that employer).

This may feel like you're saving yourself some effort - but it's less legible (so future developers will wonder why you did this), more bug prone (you may get incorrect or inconsistent addresses for a single employer), and harder to change in the future (good luck working out how many people work for a given employer just based on the address).

score 0 · Answer 2 · answered Aug 28 '18 at 02:15

"Vertical Partitioning" can be advantageous in these cases

The column(s) in the second table are usually missing, so that table has fewer rows. Note: you can get NULL by using LEFT JOIN.
The column(s) in the second table are bulky, but rarely used. There are performance disadvantages when doing SELECT * and some of the columns are TEXT/BLOB. The Vertical Partitioning may help you with speed. (Picking an appropriate ROW_FORMAT in InnoDB virtually eliminates this advantage.)
The most common queries do not need the column(s) of the second table.
You gotta add the column with no downtime. An ALTER .. ADD COLUMN .. on the main table, depending on the MySQL/MariaDB version, may block usage of it for a long time.

I suspect that only 1 table in 100 should be split this way. It is confusing to readers, etc. The benefits I list above are rare, and the benefits may not justify the effort.

The second table would have the same PRIMARY KEY as the main table, but without AUTO_INCREMENT. The two tables would not have the same secondary keys. And note that you cannot have a composite index with columns from both tables.

If the new column(s) are a bunch of "attributes" such as in a 'store' app, consider throwing them in a JSON column. This is open ended, but clumsy to use with WHERE or ORDER BY.

Database architecture, majority null fields in a column

2 Answers2