19

I recently noticed an inconsistency in how Postgres handles NULLs in columns with a unique constraint.

Consider a table of people:

create table People (
   pid  int  not null,
   name text not null,
   SSN  text unique,
 primary key (pid)
);

The SSN column should be kept unique. We can check that:

-- Add a row.
insert into People(pid, name, SSN)
values(0, 'Bob', '123');

-- Test the unique constraint.
insert into People(pid, name, SSN)
values(1, 'Carol', '123');

The second insert fails because it violates the unique constraint on SSN. So far, so good. But let's try a NULL:

insert into People(pid, name, SSN)
values(1, 'Carol', null);

That works.

select *    
from People;

0;"Bob";"123"
1;"Carol";"<NULL>"

A unique column will take a null. Interesting. How can Postgres assert that null is in any way unique, or not unique for that matter?

I wonder if I can add two rows with null in a unique column.

insert into People(pid, name, SSN)
values(2, 'Ted', null);

select *    
from People;

0;"Bob";"123"
1;"Carol";"<NULL>"
2;"Ted";"<NULL>"

Yes I can. Now there are two rows with NULL in the SSN column even though SSN is supposed to be unique.

The Postgres documentation says, For the purpose of a unique constraint, null values are not considered equal.

Okay. I can see the point of this. It's a nice subtlety in null-handling: By considering all NULLs in a unique-constrained column to be disjoint, we delay the unique constraint enforcement until there is an actual non-null value on which to base that enforcement.

That's pretty cool. But here's where Postgres loses me. If all NULLs in a unique-constrained column are not equal, as the documentation says, then we should see all of the nulls in a select distinct query.

select distinct SSN
from People;

"<NULL>"
"123"

Nope. There's only a single null there. It seems like Postgres has this wrong. But I wonder: Is there another explanation?


Edit:

The Postgres docs do specify that "Null values are considered equal in this comparison." in the section on SELECT DISTINCT. While I do not understand that notion, I'm glad it's spelled out in the docs.

Alan
  • 3,815
  • 1
  • 26
  • 35

4 Answers4

23

It is almost always a mistake when dealing with null to say:

"nulls behave like so-and-so here, *so they should behave like such-and-such here"

Here is an excellent essay on the subject from a postgres perspective. Briefly summed up by saying nulls are treated differently depending on the context and don't make the mistake of making any assumptions about them.

user14570231
  • 119
  • 4
  • 3
    Thanks for the link to that essay. The author's point that "the best way to think about NULL is as a Frankenstein monster of several philosophies" is amusing if not exactly satisfying. – Alan Apr 29 '11 at 17:13
  • 1
    :( link is live no longer -_- – X-Coder Nov 23 '16 at 08:50
  • Internet Archive mirror for the link : http://web.archive.org/web/20160408042413/http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/ – xtreak Jan 31 '17 at 08:47
12

The bottom line is, PostgreSQL does what it does with nulls because the SQL standard says so.

Nulls are obviously tricky and can be interpreted in multiple ways (unknown value, absent value, etc.), and so when the SQL standard was initially written, the authors had to make some calls at certain places. I'd say time has proved them more or less right, but that doesn't mean that there couldn't be another database language that handles unknown and absent values slightly (or wildly) differently. But PostgreSQL implements SQL, so that's that.

As was already mentioned in a different answer, Jeff Davis has written some good articles and presentations on dealing with nulls.

Peter Eisentraut
  • 35,221
  • 12
  • 85
  • 90
6

NULL is considered to be unique because NULL doesn't represent the absence of a value. A NULL in a column is an unknown value. When you compare two unknowns, you don't know whether or not they are equal because you don't know what they are.

Imagine that you have two boxes marked A and B. If you don't open the boxes and you can't see inside, you never know what the contents are. If you're asked "Are the contents of these two boxes the same?" you can only answer "I don't know".

In this case, PostgreSQL will do the same thing. When asked to compare two NULLs, it says "I don't know." This has a lot to do with the crazy semantics around NULL in SQL databases. The article linked to in the other answer is an excellent starting point to understanding how NULLs behave. Just beware: it varies by vendor.

user14570231
  • 119
  • 4
  • 2
    I disagree slightly on one minor point: To me, null may represent either the absence of a value or that it's currently unknown. I agree with everything else, especially that you cannot compare unknowns because they are... well... unknown. That's why the unique behavior I noted in my question makes sense. What I still don't understand is why nulls, while 'distinct' in unique indexes, are not considered distinct in SELECT statements. – Alan Apr 29 '11 at 17:05
  • @Alan - I'd like to suggest that the 'null' value represents whatever the DBA wants it to represent, and that sensible DBAs use nulls to represent *one thing only*, which might be some kind of 'unknown' or 'absence' but whatever it is, the DBA knows exactly what it means –  Apr 29 '11 at 17:29
  • 1
    @JackPDouglas - Interesting... I never thought of NULLs meaning what the DBA wants. NULLs, according to the great (database) maker himself, E.F. Codd, represent "missing information". [See "Extending the database relational model to capture more meaning" at http://portal.acm.org/citation.cfm?id=320109] C.J. Date points out that this leads us to three-value logic: true, false, and unknown. Date then gets very upset about this and suggests not using NULLs at all and sticking to default values. Whatever the case, fascinating stuff. – Alan Apr 29 '11 at 18:01
  • @Alan - one problem is that there is more than one kind of 'missing information' :) I personally think "not using nulls at all" is excessive - they can be very useful if used sensibly –  Apr 29 '11 at 18:08
  • @JackPDouglas - Oh, I agree. NULLs are very useful if we're careful about it. Date gets curmudgeonly at times. – Alan Apr 29 '11 at 18:12
4

Multiple NULL values in a unique index are okay because x = NULL is false for all x and, in particular, when x is itself NULL. You'll also run into this behavior in WHERE clauses where you have to say WHERE x IS NULL and WHERE x IS NOT NULL rather than WHERE x = NULL and WHERE x <> NULL.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • I agree that multiple null values (such as they are) are okay in indexes. I object to them not being distinct. (Actually, equality is not defined for null because null is unknown. So "x = null" is not false, it's undefined. That's why there's an "is" operator, so you can -- as you point out -- test for null in WHERE clauses.) – Alan Apr 29 '11 at 17:01
  • @Alan: NULL can't be compared to any value, or another NULL. That is why you can have two unique NULLs. Your objection to distinctness comes from a desire to treat NULLs as special values, which they are not. – jmz May 01 '11 at 10:21