13

Some while ago, I've been reading through the book SQL and Relational Theory by C. J. Date. The author is well-known for criticising SQL's three-valued logic (3VL).1)

The author makes some strong points about why 3VL should be avoided in SQL, however he doesn't outline how a database model would look like if nullable columns weren't allowed. I've thought on this for a bit and have come up with the following solutions. If I missed other design options, I would like to hear about them!

1) Date's critique of SQL's 3VL has in turn been criticized too: see this paper by Claude Rubinson (includes the original critique by C. J. Date).


Example table:

As an example, take the following table where we have one nullable column (DateOfBirth):

#  +-------------------------------------------+
#  |                   People                  |
#  +------------+--------------+---------------+
#  |  PersonID  |  Name        |  DateOfBirth  |
#  +============+--------------+---------------+
#  |  1         |  Banana Man  |  NULL         |
#  +------------+--------------+---------------+

Option 1: Emulating NULL through a flag and a default value:

Instead of making the column nullable, any default value is specified (e.g. 1900-01-01). An additional BOOLEAN column will specify whether the value in DateOfBirth should simply be ignored or whether it actually contains data.

#  +------------------------------------------------------------------+
#  |                              People'                             |
#  +------------+--------------+----------------------+---------------+
#  |  PersonID  |  Name        |  IsDateOfBirthKnown  |  DateOfBirth  |
#  +============+--------------+----------------------+---------------+
#  |  1         |  Banana Man  |  FALSE               |  1900-01-01   |
#  +------------+--------------+----------------------+---------------+

Option 2: Turning a nullable column into a separate table:

The nullable column is replaced by a new table (DatesOfBirth). If a record doesn't have data for that column, there won't be a record in the new table:

#  +---------------------------+ 1    0..1 +----------------------------+
#  |         People'           | <-------> |         DatesOfBirth       |
#  +------------+--------------+           +------------+---------------+
#  |  PersonID  |  Name        |           |  PersonID  |  DateOfBirth  |
#  +============+--------------+           +============+---------------+
#  |  1         |  Banana Man  |
#  +------------+--------------+

While this seems like the better solution, this would possibly result in many tables that need to be joined for a single query. Since OUTER JOINs won't be allowed (because they would introduce NULL into the result set), all the necessary data could possibly no longer be fetched with just a single query as before.


Question: Are there any other options for eliminating NULL (and if so, what are they)?

stakx - no longer contributing
  • 83,039
  • 20
  • 168
  • 268
  • 1
    could you shortly explain why the three-value-logic should be avoided. The reason I'm aware of is, that you have to store at least one bit more. But if you add another column instead, it doesn't make sense. Also another table leads to query overhead. Another reason I can think of is, you have to handle the NULL value, but you have it with your solutions, too. – TooAngel Jun 20 '10 at 16:07
  • @TooAngel: It's not just about having to store an additional bit. It's about getting query results that don't seem to make (common) sense, e.g. `COUNT(*)` won't count `NULL`, or that `NULL` never equals `NULL` (because basically, `NULL` has the meaning of "unknown"). -- I suggest you read the 5-page paper that I linked to in the footnote. It contains Date's (apparently flawed, but still very insightful) critique of 3VL. Also, you might want to check out the Wikipedia article on ternary logic: http://en.wikipedia.org/wiki/Ternary_logic – stakx - no longer contributing Jun 20 '10 at 16:10
  • To the above, I should add the main point, which is that due to the sometimes "unintuitive" results that you can get thanks to 3VL, results are easily mis-interpreted. Or even worse, a query is not what one thinks it is, and one will get (correct) results that won't seem right. A last point is that `NULL` is often used to mean different things, e.g. "unknown", "missing", "not applicable", etc., which makes it even harder to formulate correct queries and to interpret correctly the result you get from the DB. – stakx - no longer contributing Jun 20 '10 at 16:30
  • Ok, nice, very interesting paper. I would like to provide another option how to overcome the three way logic, but your options are the ones I would also think of. – TooAngel Jun 20 '10 at 16:59
  • @stakx-nolongercontributing - `COUNT(*)` rows with `NULLs`; `COUNT(x)` does not count rows where `x IS NULL`. – Rick James Jul 07 '20 at 05:34

7 Answers7

5

I saw Date's colleague Hugh Darwen discuss this issue in an excellent presentation "How To Handle Missing Information Without Using NULL", which is available on the Third Manifesto website.

His solution is a variant on your second approach. It's sixth normal form, with tables to hold both Date of Birth and identifiers where it is unknown:

#  +-----------------------------+ 1    0..1 +----------------------------+
#  |         People'             | <-------> |         DatesOfBirth       |
#  +------------+----------------+           +------------+---------------+
#  |  PersonID  |  Name          |           |  PersonID  |  DateOfBirth  |
#  +============+----------------+           +============+---------------+
#  |  1         |  Banana Man    |           ! 2          | 20-MAY-1991   |
#  |  2         |  Satsuma Girl  |           +------------+---------------+
#  +------------+----------------+
#                                  1    0..1 +------------+
#                                  <-------> | DobUnknown |
#                                            +------------+
#                                            |  PersonID  |
#                                            +============+
#                                            | 1          |
#                                            +------------+

Selecting from People then requires joining all three tables, including boilerplate to indicate the unknown Dates Of Birth.

Of course, this is somewhat theoretical. The state of SQL these days is still not sufficiently advanced to handle all this. Hugh's presentation covers these shortcomings. One thing he mentions is not entirely correct: some flavours of SQL do support multiple assignment - for instance Oracle's INSERT ALL syntax.

APC
  • 144,005
  • 19
  • 170
  • 281
  • I've read the paper on Third Manifesto and quite like the solution, mainly because it resolves the ambiguity of what exactly `NULL` means. While this separation of different meanings of `NULL` into separate tables improves data quality, I agree with you that actually querying the data will become more difficult. – stakx - no longer contributing Jun 20 '10 at 19:59
  • 1
    +1 for the Darwen article. Minor note: if the relationships are indeed 1:0..1 as indicated then multiple assignment is not an issue. – onedaywhen Sep 07 '11 at 09:00
3

I recommend you go for your option 2. I'm fairly certain Chris Date would too because essentially what you are doing is fully normalizing to 6NF, the highest possible normal form which Date was jointly responsible for introducing. I second the recommended Darwen's paper on handling missing information.

Since OUTER JOINs won't be allowed (because they would introduce NULL into the result set), all the necessary data could possibly no longer be fetched with just a single query as before.

…this is not the case, but I agree the issue of outer join is not explicitly mentioned in the Darwen paper; it was the one thing that left me wanting. The explicit answer may be found in another of Date's book…

First, note that Date and Darwen's own truly relational language Tutorial D has but one join type being the natural join. The justification is that only one join type is actually needed.

The Date book I alluded to is the excellent SQL and Relational Theory: How to Write Accurate SQL Code:

4.6: A Remark on Outer Join: "Relationally speaking, [outer join is] a kind of shotgun marriage: It forces tables into a kind of union—yes, I do mean union, not join—even when the tables in question fail to conform to the usual requirements for union... It does this, in effect, by padding one or both of the tables with nulls before doing the union, thereby making them conform to those usual requirements after all. But there's no reason why that padding shouldn't be done with proper values instead of nulls

Using your example and default value '1900-01-01' as 'padding', the alternative to outer join could look like this:

SELECT p.PersonID, p.Name, b.DateOfBirth
  FROM Person AS p
       INNER JOIN BirthDate AS b
          ON p.PersonID = b.PersonID
UNION
SELECT p.PersonID, p.Name, '1900-01-01' AS DateOfBirth
  FROM Person AS p
 WHERE NOT EXISTS (
                   SELECT * 
                     FROM BirthDate AS b
                    WHERE p.PersonID = b.PersonID
                  );

Darwen's paper proses two explicit tables, say BirthDate and BirthDateKnown, but the SQL would not be much different e.g. a semi join to BirthDateKnown in place of the semi difference to BirthDate above.

Note the above uses JOIN and INNER JOIN only because Standard SQL-92 NATURAL JOIN and UNION CORRESPONDING are not widely implemented in real life SQL products (can't find a citation but IIRC Darwen was largely responsible for the latter two making it into the Standard).

Further note the above syntax looks long-winded only because SQL in general is long-winded. In pure relational algebra it is more like (pseudo code):

Person JOIN BirthDate UNION Person NOT MATCHING BirthDate ADD '1900-01-01' AS DateOfBirth;
onedaywhen
  • 55,269
  • 12
  • 100
  • 138
1

I haven't read it, but there's an article called How To Handle Missing Information Using S-by-C on the the Third Manifesto website that's run by Hugh Darwen and C.J. Date. This isn't written by C.J. Date, but I'd assume that since it's one of the articles on that website it's probably similar to his opinions.

Hans Olsson
  • 54,199
  • 15
  • 94
  • 116
  • There's some interesting information on that website. However, the specific paper that you mention seems more academic in nature: It gives examples in the Tutorial D language and not in SQL, which is what's actually used today. (+1 for the link to the website.) – stakx - no longer contributing Jun 20 '10 at 16:49
  • @stakx: Ah, doesn't surprise me I suppose, I found the website a while ago after reading another question on SO but I found it all a bit too academical to actually read through anything. – Hans Olsson Jun 20 '10 at 16:51
0

One option is to use explicit option types, analogous to Haskell's Maybe functor.

Unfortunately a lot of existing SQL implementations have poor support for user-defined algebraic data types and even poorer support for user-defined type constructors that you really need to do this cleanly.

This recovers a sort of "null" for only those attributes where you explicitly ask for it, but without null's silly three-valued logic. Nothing == Nothing is True, not unknown or null.

Support for user-defined algebraic types also helps when there are a few reasons for missing information, for example a database equivalent of the following Haskell type would be a good solution for the obvious application:

data EmploymentStatus = Employed EmployerID | Unemployed | Unknown

(Of course, a database supporting this would also need to support the more-complicated-than-usual foreign key constraint that comes with it.)

Short of this, I agree with APC's and onedaywhen's answers about 6NF.

Community
  • 1
  • 1
Doug McClean
  • 14,265
  • 6
  • 48
  • 70
  • 1
    SQL `NULL` does not mean "nothing", but "unknown". Therefore it's correct that `NULL = NULL` is not true: If you compare two unknowns, you cannot be sure that they are equal. – stakx - no longer contributing May 31 '13 at 08:23
  • It's "correct" because that's how it's defined, but it is _hugely_ problematic for a number of reasons. Consult http://www.amazon.com/Foundation-Object-Relational-Databases-Manifesto/dp/0201309785 for the gory details which won't fit in this comment. (It's also an old war which doesn't need to be re-debated, but the pro-null side is objectively wrong on the merits.) – Doug McClean May 31 '13 at 13:06
0

One alternative may be the entity-attribute-value model:

 entity  attribute    value
 1       name         Banana Man
 1       birthdate    1968-06-20

If the birthdate was unknown, you'd just omit its row.

Andomar
  • 232,371
  • 49
  • 380
  • 404
  • I guess the problem with this model is that you have to choose a single type for column `value` that is appropriate for all sorts of things (ie. this would require parsing strings or even `BLOB`s on the client side). – stakx - no longer contributing Jun 20 '10 at 16:24
  • @stakx: You could add a `type` column along with `str_value`, `int_value`, `real_value` columns. And I think SQLIte supports EAV natively and store integers, floats and strings in the same column efficiently – Andomar Jun 20 '10 at 16:40
  • I didn't know that SQLite has native support for the EAV model. It's an interesting feature, despite being a non-standard (?) one that other RDBMS might not have. Thanks for the info! – stakx - no longer contributing Jun 20 '10 at 16:47
  • 1
    EAV is at least controversial - check out http://www.simple-talk.com/sql/t-sql-programming/avoiding-the-eav-of-destruction/ or http://www.simple-talk.com/sql/database-administration/five-simple--database-design-errors-you-should-avoid/ for some good reason why you should at least be careful when choosing the EAV approach – marc_s Jun 20 '10 at 17:13
0

Option 3: Onus on the record writer:

CREATE TABLE Person
(
  PersonId int PRIMARY KEY IDENTITY(1,1),
  Name nvarchar(100) NOT NULL,
  DateOfBirth datetime NOT NULL
)

Why contort a model to allow null representation when your goal is to eliminate them?

Amy B
  • 108,202
  • 21
  • 135
  • 185
  • *Every* solution to my question would eventually have `NOT NULL` on all columns (where applicable). But the real question -- if you don't reduce my post to just the last sentence -- was *how* a relational data model would have to be modified **so that it can still accomodate missing or unknown information** for some attributes. Your solution however completely *disallows* missing information for the `DateOfBirth` column and thus misses the most important bit. – stakx - no longer contributing Jun 22 '10 at 20:50
  • If you still permit the third value, you must still write logic about the third value. I don't know why your implementation of the third value would be better than any db vendor's implementation. – Amy B Jun 22 '10 at 23:39
  • 1
    I want to get *rid* of the third value (`UNKNOWN`), *but* at the same time still be able to represent the fact that some information is missing (or unknown, or whatever). This is obviously possible. I am investigating different ways of doing this. You're perhaps saying, "Why not just use `NULL`, it's there for just that purpose?" -- Because the 3VL *can* lead to confusing and unintuitive queries and query results. Even without 3VL, one of course still has to deal *somehow* with possibly missing information -- but doing this with 2VL might prevent some logic errors. – stakx - no longer contributing Jun 23 '10 at 18:28
  • 2
    It's the concept of null that makes the queries confusing and unintuitive, not the mechanism of null. As long as the concept is allowed, you will have to struggle against it. – Amy B Jun 23 '10 at 19:28
0

You can eliminate null in the output as well by using COALESCE.

SELECT personid  /*primary key, will never be null here*/
       , COALESCE(name, 'no name') as name
       , COALESCE(birthdate,'no date') as birthdate
FROM people

Not all databases support COALESCE, but almost all have a fallback option called
IFNULL(arg1, arg2) or something simular that will do the same (but only for 2 arguments).

Johan
  • 74,508
  • 24
  • 191
  • 319