When are nulls "safe" in a column?

Question

Is there a general rule of thumb for designing databases to allow the columns as nulls vs 3nf normalization? I have a table with a column that will mostly consist of nulls (85%) but the table size is not over 10K records (not very big). It's mostly for logging and record keeping so most transactions will be inserts and selects, no updates. I'm trying to consider both the performance as well as simplified design. Will there be a great benefit with either denormalized or normalized in this case? Do different RDBMSs behave differently?

[What to do with null values when modeling and normalizing?](https://stackoverflow.com/a/40733625/3404097) — philipxy, Jun 26 '22 at 10:58

Larry Lustig · Answer 1 · 2011-01-28T04:02:03.050

There are three separate reasons to avoid NULLs. In order of importance (in my opinion, of course) they are:

You want to correctly model your data.
You want to provide a database that makes it easy to produce correct applications.
You want to save disk space.
You're concerned about performance.

Okay, there are four separate reasons!

Of the four, I'm very concerned about #1, pretty concerned about #2 (especially since I'm always the programmer, too) and I really don't care much about #3 both because disk space is cheap and also because #1 and #2 get rid of most NULLs in any database for me. As for #4, I never want to sacrifice correctness for performance — it doesn't matter how fast I get the answer if I can't be confident that it's correct. If I can't meet my performance goals in a well-modeled SQL database then an SQL database might be the wrong tool for the job.

So the most important question to me would be "does a single-table approach correctly model the data you want to store?" Also important is "will the single table approach make it too easy to write incorrect code against the database?" I want to force myself (or any programmer who comes after me) to have to put data in the correct places. Look at your data and answer those questions, and it should help you decide on the correct structure.

I like your approach to this topic, but I'm more focused on database development and I have to work with other programmers to code the project while I take care of the back end. This includes optimization as well as efficient design, so in my position I feel that number 4 and 3 are more of a priority. Thank you for your input. — suhprano, Jan 27 '11 at 18:51

score 4 · Answer 2 · edited Jun 26 '22 at 09:50

Nulls are not safe, so there is no point trying to make them safe or justifying them or re-defining them as "safe".

Self-Contradiction

When you state things like If I should normalize and split into 2 or 3 others just to avoid a bunch of nulls in my table, or if I should keep one table and leave the nulls in to simplify my code and my design, and avoid extra joins. I was trying to be generic to see what's a good standard so we can apply it to various scenarios., you are working at cross-purposes with yourself, on several different points. So no one can help you reasonably. The first thing to do is resolve your cross-purposes.

Either you want Standards (Normalisation, no Nulls, many small fast tables, ease of use and ease of extension) Or you want simple code (fat table, nulls, no performance, impossible to extend)
Either you want generic standards or shortest code blocks.

Justification

Now, being a human being, like the millions of coders before you, whatever you choose, you will justify. Just look at the conflicting and conflicted answers. They all make their choices, and then justify them.

One Standard Technical Answer

But you have asked a technical question, about a well-known subject that has been answered by the giants of the industry over 30 years ago. Standards bodies have accepted these principles as Standards. There is only one technical answer. The other answers are justifications for non-technical and sub-standard methods.

Normalise. Not just to avoid a bunch of nulls in my table, but because if it isn't Normalised, it isn't a Database, it is a flat file.
Normalise because it gets rid of data duplication.
Normalise because Normalised Databases are much, much faster than flat files.

It is a matter of simple physics. Normalised rows are much shorter; therefore many more rows fit into the same disk block, or page, and therefore many more rows fit into any given memory (cache). It should be no surprise that that will lead to faster processing and less I/O overall, for all users of the server.
Normalise because the resulting database will be much smaller (more, smaller tables, but smaller overall)
And finally Normalised data will not have Nulls.
Nulls mean one of two things.
Either "optional" fields (well, they can't be columns, because it is not a database), which means that data is not normalised.
Or "missing/unknown value" which means you have no Data Integrity (again, a flat file, not a database); the data cannot be relied upon for analysis.

Sure, SQL is cumbersome with joins, but SQL is all we have, so deal with it. That means, learn to code joins, use cut-and-paste.

"Join Cost"

SQL was designed for Relational database, not for flat files. That means many small tables, rather than fewer large tables. Joins are pedestrian for Relational databases, there is no point is "avoiding joins". There is a myth that "joins cost", but thus far no one has provided any evidence. All the SQL vendors have been enhancing their engines over 25 years, hundreds of man years by serious engineers, to ensure that joins cost nothing.

Now do not mix things up, do not misconstrue what I am saying:

the cost is in the size of the data sets being joined; whether indices can be used; the nature of the join; if there is a DataType mismatch; the search arguments; etc. But the code itself required for joins (assuming we are joining on Keys). the "join cost", is nothing. Just check you statistics and query plans.
And do not make your estimations based on your knowledge, which is, as evidenced, limited to joining fat flat files; to be sure, as I have already explained the cost of joins, joining those monsters costs an awful lot.

SQL and Non-SQL

You have tagged you question "SQL" and "MySQL". SQL is a Standard published by IEC/ISO/ANSI. MySQL is not SQL. The treatment of Null is set out in the Standard. What MySQL does, is non-standard, in both engines. In fact, what it did last year and what it will do this year, are different, and non-standard.

To call a Non-SQL, "SQL", when SQL is a Standard, is a simple fraud. Just like calling a bunch of flat files a "database".

Point is, you will get one answer if your question was tagged "SQL" and a different answer if it was tagged "MySQL".

Un-normalised for Coder Convenience

The main reason coders should not be allowed to design "databases" is demonstrated beautifully in this thread. They have no concerns about performance or ease of use to others. If we left it to them, they would design flat files full of nulls, to "simplify" their code, and actually justify it.

You have some knowledge here, but you also assume my limitations and patronize my procedures as to figuring out performance vs simplicity. — suhprano, Jan 27 '11 at 17:48
You have some knowledge here, but you're also direct to point my limitations and patronize my attempt when I'm simply asking the community about their experiences to know a little more about performance vs simplicity. You assume this role as some know at all with a grandiose tone to ridicule my op and even tho your post may be informative, I do not appreciate your efforts. One more thing, I tagged MySQL to include as it was relevant to my RDBMS question. Nowhere did I question standards or make any claims of such. I'm sure you had your heart in the right place, but it was poorly executed. — suhprano, Jan 27 '11 at 18:01
@Adam: +1 but the pity of the thing is that the guy isn't a troll. He actually has some valuable insight that could benefit people and puts an awful lot of effort into his answers. It's a shame he's so unpleasant (and yes, he's been very unpleasant to me personally). — Larry Lustig, Jan 27 '11 at 18:19
@suhprano. 1) No offence intended. I quoted your statements. 2) Would you like me to post evidence ? 3) If you look closely at the comments & answers, you will notice the community has personal non-technical opinions (posted as facts), not technical facts; they uphold the myths. Anyone posting technical facts always gets flak from the "technical" "community". I am damned if I join the non-technical community, and I am damned if I follow the FAQ and counter the misinformation n SO. — PerformanceDBA, Jan 27 '11 at 21:49
@Larry. 1) I thought I have voted you up, no unpleasantness, no personalities (that is not to say the many here are personally involved). I follow the FAQ and correct incorrect technical info when I see it. If you point out the "unpleasantness" I will correct it. Thanks. 2) Middle names. People are allowed to not have a middle name. Have you heard of the empty string ? According to the Standard, it is different to Null. — PerformanceDBA, Jan 27 '11 at 21:54
1)You quote them out of context to assemble your own undermining conclusions.2)Evidence is welcome but I'm asking for generality and various approaches on the matter. There isn't just one right answer. 3)People comment from personal work experience. I openly support discussions, but you refute it by acting like the db forum nazi, debunking what everyone else says and preaching only your so called knowledge. I know it's a tech forum, but as a human being you're obligated to be civil and have respect for other's opinions. The community will then decide to agree or disagree. — suhprano, Jan 27 '11 at 22:28
@DBA: Apologies if that was over the line. My response was for @suhprano, as an indication that you do not generally respond well to anyone critiquing your method of interaction with people on this site. You're quite abrasive and insulting, and your response to this, rather than attempting to adopt a sliver of common courtesy (something also mentioned in the FAQ, incidentally), is to attack. If you would put half the effort into being civil as you do with your responses, you would be an asset to the community. As it stands now, your expertise gets lost in your grandstanding and condescension. — Adam Robinson, Jan 27 '11 at 22:28
@suhprano. 1) I post technical facts, not personal conclusions. Which bit do you need more evidence for ? 2) In technical matters, there is one right answer, with technical facts to support it. Agreement does not matter. However, if you ask a group of non-technical people, they will answer from experience, and personal feelings. That is why, in a popularity based system, many wrong answers are popular, and many right answers are not. Technical relevance is set aside. That is not my idea of a "technical" website, where everyone is encouraged to be technically correct (as per FAQ). — PerformanceDBA, Jan 27 '11 at 23:47
@Larry. I can't find it, in order to apologise, I need to know what I am apologising for. Was that the question where you commented "I'm not sure what the question is" ? — PerformanceDBA, Jan 29 '11 at 00:25
@Larry. Is [this question](http://stackoverflow.com/questions/4261057/table-with-2-fields-but-only-one-can-be-used-at-a-time/4275227#4275227) the one you mean ? You (clearly now) misunderstood my answer, and made generalised comments, which were false, and could not specify anything wrong when requested. You had problems reading specific technical information. You did not have the courtesy to retract you generalised statements. If anything, you should be apologising to me! (I am not holding my breath.) — PerformanceDBA, Jan 31 '11 at 06:55

score 2 · Answer 3 · answered Jan 26 '11 at 18:21

2

Are you talking about tables allowing users to store NULL as opposed to an empty string, or a number?

If so, I would say that you should only allow NULL values in a column if NULL has a special meaning (not just equivalent to an empty string '' or 0).

answered Jan 26 '11 at 18:21

TehShrike

9,855
2
33
28

score 2 · Answer 4 · answered Jan 26 '11 at 18:23

Logging tables should be denormalized as a general rule, since you are interested in the state of things at the time of the event and you don't want to just hold onto a foreign key to anything as it may have changed since then.

However, I don't see what NULL has to do with normalization here?

score 2 · Answer 5 · edited Jul 15 '11 at 20:13

2

NULL means 'missing or unknown'. It's got little to do with normalization. It's got to do with domain of the attribute. If the attributes is required it is non-null (e.g., EmployeeName is probably a required attribute for the Employee Attribute). If the attribute is optional or otherwise non-required (e.g., not all Employees have — or are willing to share with their boss — a home email address), hence the column EmployeeHomeEmailAddress should be nullable.

Whether that column itself is properly a part of the table is dependent on its semantics. A nullable column does not necessarily (though it might) indicate a design issue — perhaps it should be its own entity in the model, with a 0:1 or 0:M cardinality to the parent entity. Perhaps it shouldn't. It all depends on the semantics of the entity and the attribute. But you can't just make a blanket assertion that nullable column == denormalized data: that way leads to madness.

edited Jul 15 '11 at 20:13

Community

1
1

answered Jan 26 '11 at 19:09

Nicholas Carey

71,308
16
93
135

3

Nulls have everything to do with normalization: 1st normal form requires that nullable columns cannot exist. All other normal forms build upon first normal form and therefore inherit this requirement, meaning that no normalized data structure can have a null value. That's not to say that nulls are somehow "bad" in practice, but when it comes to theoretical normalization, they -- technically -- are. This nullability denial in normal forms is controversial, but established. – Randolpho Jan 26 '11 at 21:26
I was comparing nullable columns with 3NF, where it says that no table designs should allow nulls... So it does have some impact in normalization. I wasn't making an blanket assertion == denormalized data, but that nullable columns are not in 3NF. – suhprano Jan 26 '11 at 21:33
4

How does 1NF require non-nullable attributes? 1NF gets rid of repeating groups and non-atomic attributes. I understand that Chris Date doesn't like nullability, but he doesn't live in the real world. Sorry, I'm on Mr Codd's side WRT null (and I agree with him that there should be two distinct null values ('missing/applicable' and 'missing/inapplicable') — they represent the conceptual difference between not answering a question and answering 'I don't know' or 'other'. – Nicholas Carey Jan 26 '11 at 21:56
@Nicholas Carey: Date is technically correct, but that doesn't mean we need to be slaves to non-nullability. And Codd just wants normalization and practicality to mean the same thing, but *they don't have to*. Normalize till it hurts, de-normalize till it works. – Randolpho Jan 26 '11 at 22:59
1

@Randolpho. I agree with where you are coming from, but what you state re "normalisation" and "NFs" is simply not correct. When you start making statements about Date being this or that, you damage your own credibility. – PerformanceDBA Jan 27 '11 at 05:55
3

@PerformanceDBA: you don't have to agree with somebody in general to acknowledge the factual veracity of one or more of that person's statements. I'd argue that your belief that my acknowledgement of Date's veracity somehow damages my credibility actually damages your own. – Randolpho Jan 27 '11 at 15:15
@Randolpho. That's how arguments get tangential, and continue. The point is, you are in no position to have opinions about Date, or about the applicability of his principles. Post your qualifications. Second, your sttmts re "normalisation" and "NFs" are simply incorrect. I would appreciate if you answered those point, rather than getting caught up in tangential issues. Nicholas has a technically correct understanding of the problem, and the definitions; you do not; your countering him, is incorrect. I am following the FAQ and correcting incorrect information. – PerformanceDBA Jan 27 '11 at 22:05

score 2 · Answer 6 · answered Jan 26 '11 at 21:28

2

Bottom line: if you want your data structure to truly be normalized to 3rd Normal Form, you must not have null columns.

If you live and work in the real world like the rest of us, you'll find nullable columns perfectly acceptable and "safe". Your database won't technically be normalized, but what data system truly is?

Very few.

answered Jan 26 '11 at 21:28

Randolpho

55,384
17
145
179

In the "real world" it is usually important that you get the right answers and that the database accurately models reality. With nulls you get some answers which are wrong and which contradict reality. That is why nulls are unsafe and that is the perfectly sensible and prudent reason to avoid them. – nvogel Jan 27 '11 at 10:17
1

@dportas: I agree that avoiding nulls -- *in general* -- is a good idea, but very frequently the data model must support the concepts of null because "reality" requires them; reality being defined here as "the business model". Sometimes optional fields need to be allowed, and it's not always a good idea to stick those values in a separate table just to satisfy a "no nulls" rule. I've said it before: normalize till it hurts, denormalize till it works. – Randolpho Jan 27 '11 at 15:13
@dportas: What do you do about middle names? – Larry Lustig Jan 27 '11 at 19:28
1

@Randolpho: No reality ever *requires* nulls. Nulls are just an implementation choice: the database designer chooses whether to use them or not (as I think you are implicitly acknowledging in your reply). A responsible database designer ought to acknowledge and properly understand the implications of making that choice and not try to shift the blame onto the business model. – nvogel Jan 27 '11 at 19:44
@dportas: absolutely, a database designer should definitely acknowledge and understand the implications of that choice. But I think the blame should definitely be pushed to the business model, since the *first* choice should be "no nulls" -- unless requirements dictate otherwise. – Randolpho Jan 27 '11 at 20:14
3

@Randolpho: In 20 years of developing database applications I never yet encountered a case where the business model "required" a null. There are no nulls in reality - null is just a symbol that can optionally be used to represent some information in a database. All information can be represented accurately without nulls however, so nulls are purely a technical decision that has nothing to do with business requirements. – nvogel Jan 27 '11 at 20:30
@dportas: What do you do about middle names? – Larry Lustig Jan 27 '11 at 21:04
2

@Larry. dportas probably does what most experienced Data Modellers do. He probably understands the difference between Null and the empty string. – PerformanceDBA Jan 29 '11 at 00:21
How does an empty string help? That can handle the small percentage of people who have no middle name, and tell you so. But most people do have middle names and you generally don't learn them at the beginning of a business relationship. No middle name and middle name not known are two formally different situations. You still need to (correctly) represent the second situation. – Larry Lustig Jan 29 '11 at 01:49
2

@Larry. You jumping from context to context, to suit your purpose. Read the question again and stick to the context. If you need to knw about the analysis stage of a project, ask a new question. If you need to know how analysts treat or model "unknown" vs "missing" vs "optional" vs "mandatory" as that question. This one is about Nulls. The empty string may not be useful to you, but it may be useful to other people, it may have some purpose, after all, it is defined in the standard. – PerformanceDBA Jan 29 '11 at 10:33

score 1 · Answer 7 · answered Jan 26 '11 at 18:23

nulls are "safe" when you don't always need an input. In fact, they may be preferred if there is no real default data. Example, let's say you're tracking inventory and the columns are id, length, width, height, weight

Maybe you don't get a count. The width shouldn't be 0, it should be null. This is particularly relevant if 0 is a valid width or value for what you are tracking.

score 1 · Answer 8 · answered Jan 26 '11 at 18:25

1

SQL Server 2008 has sparse columns and filtered indexes to help address this situation.

answered Jan 26 '11 at 18:25

Joe Stefanelli

132,803
19
237
235

Chris Simpson · Answer 9 · 2011-01-26T19:28:22.373

If I understand the question you are referring to the normalisation of data that does not apply to every record in one table. So saying you have a vehicle table in your database that looked as so:

vehicle
----------------------------
vehicleId int
makeId int not null references make(makeId)
modelId int not null references model(modelId)
numberOfWheels tinyint null
hullMaterialId int null refrences hullMaterial(hullMaterialId)

Then you could normalise out some of these null columns into "1 to zero or 1" tables to make better sense of the data:

roadvehicle
----------------------------
vehicleId int references vehicle(vehicleId)
numberOfWheels tinyint not null

waterVehicle
----------------------------
vehicleId int references vehicle(vehicleId)
hullMaterialId int not null refrences hullMaterial(hullMaterialId)

(Please pardon the much simplified example)

I believe you are considering the right things. To me, if it makes sense from a design and a domain point of view, then it should be done but I'm not aware of any rule of thumb for this, just experience.

That's basically what I was looking for... If I should normalize and split into 2 or 3 others just to avoid a bunch of nulls in my table, or if I should keep one table and leave the nulls in to simplify my code and my design, and avoid extra joins. I was trying to be generic to see what's a good standard so we can apply it to various scenarios. — suhprano, Jan 26 '11 at 19:55
I did work somewhere where the table design for describing criminal charges involved having a giant table with at least 100 columns including vehicle licence plate and dog colour. It could've benefitted significantly from normalisation. — Mike Meyers, Jan 26 '11 at 23:28

score 1 · Answer 10 · 2011-01-26T19:47:40.080

Some things I have learned with respect to NULL values -- trying to avoid a duplicate answer :-)

Additionally, relevant information is also presented in the Wikipedia article NULL: Controversy (and surrounding context).

Not all RDBMS systems behave the same.

In SQL Server, NULL is considered a unique-ing value in an UNIQUE column/index. In SQLite a NULL may appear many times and does not count as a duplicate value. Apparently the SQL92 specification is ambiguous in this regard.

NULL is NULL -> true (standard and correct) but NULL = NULL -> ??? It should be false, but this depends upon RDBMS and settings! In SQL Server this can be controlled by the ANSI_NULLS setting. This can make a join misbehave if running under a different context! Depending upon RDMBS similar behavior can also apply to other operations. Please always use/support proper SQL equality!

Different RDBMs systems may also use a different physical table layout and when a column is NULLABLE it may mean that space is not reserved in the record (this is true for CHAR(100) or XML (the stub bit) in SQL Server, for instance). If this record needs to be updated after the initial creation (e.g. was initially NULL) this can lead to excess fragmentation in certain circumstances. However, this should only be something to note and I would be hesitant to make general decisions based around such an implementation detail as there are other factors in play!

Make sure to define what NULL means

Unfortunately there is no NOTSET or INVALID values to complement NULL. Each column should have the particular meaning of NULL defined. Does it mean "invalid value" or "value not specified" or "value not know", etc? It may require mixing NULL with non-NULL sentinel values (or not using NULL at all or using additional state columns/relationships) to achieve the correct business logic.

Happy SQL'ing.

score 0 · Answer 11 · answered Jan 26 '11 at 19:18

0

while table design, if you expect 85% of value can be null for particular column, it is better to keep in another table. [based on normalization] the data block assigned for a table record may calculate based on the bytes specified for each column. [like malloc function in C]. This can cause row chaining.

answered Jan 26 '11 at 19:18

San

21
4

Really? I've never heard of the 85% rule [of thumb]. Do you have any references for that? Avoiding row chaining/block splits is not a normalization issue in my book, it's a physical implementation issue. – DCookie Jan 26 '11 at 21:15
1

@DCooke: I think the 85% came from the original question and isn't necessarily general rule of thumb. – Mike Meyers Jan 26 '11 at 23:24
1

DCookie: 85% is from original question. [means less number of values]. Using normalization we can save physical size of tables. If you think the table specification in bytes we can realise how much space can be saved using normalization – San Jan 28 '11 at 13:26

When are nulls "safe" in a column?

11 Answers11

Linked