10

Why does the comparison of value to null return false, except when using a NOT IN, where it returns true?


Given a query to find all stackoverflow users who have a post:

SELECT * FROM Users
WHERE UserID IN (SELECT UserID FROM Posts)

This works as expected; i get a list of all users who have a post.

Now query for the inverse; find all stackoverflow users who don't have a post:

SELECT * FROM Users
WHERE UserID NOT IN (SELECT UserID FROM Posts)

This returns no records, which is incorrect.

Given hypothetical data1

Users              Posts
================   ===============================
UserID  Username   PostID   UserID  Subject
------  --------   -------  ------  ----------------
1       atkins     1        1       Welcome to stack ov...
2       joels      2        2       Welcome all!
...     ...        ...      ...
399573  gt6989b    ...      ...
...     ...        ...      ...
                   10592    null    (deleted by nsl&fbi...
                   ...      ... 

And assume the rules of NULLs:

  • NULL = NULL evaluates to unknown
  • NULL <> NULL evaluates to unknown
  • value = NULL evaluates unknown

If we look at the 2nd query, we're interested in finding all rows where the Users.UserID is not found in the Posts.UserID column. i would proceed logically as follows:

Check UserID 1

  • 1 = 1 returns true. So we conclude that this user has some posts, and do not include them in the output list

Now check UserID 2:

  • 2 = 1 returns false, so we keep looking
  • 2 = 2 returns true, so we conclude that this user has some posts, and do not include them in the output list

Now check UserID 399573

  • 399573 = 1 returns false, so we keep looking
  • 399573 = 2 returns false, so we keep looking
  • ...
  • 399573 = null returns unknown, so we keep looking
  • ...

We found no posts by UserID 399573, so we would include him in the output list.

Except SQL Server doesn't do this. If you have a NULL in your in list, then suddenly it finds a match. It suddenly finds a match. Suddenly 399573 = null evaluates to true.

Why does the comparison of value to null return unknown, except when it returns true?

Edit: i know that i can workaround this nonsensical behavior by specifically excluding the nulls:

SELECT * FROM Users
WHERE UserID NOT IN (
   SELECT UserID FROM Posts
   WHERE UserID IS NOT NULL)

But i shouldn't have to, as far as i can tell the boolean logic should be fine without it - hence my question.

Footnotes

  • 1 hypothetical data; if you don't like it: make up your down.
  • celko now has his own tag
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • 3
    Duplicate: See http://stackoverflow.com/questions/129077/sql-not-in-constraint-and-null-values – Keith Oct 13 '10 at 14:30
  • i see. It's not pointed out in the other question - but it's a bug in how the `NOT IN` is expanded. Rather than logically determining if the row is *"not in"* the list, it is expanding it to a series of `and <>` clauses. – Ian Boyd Oct 13 '10 at 14:42
  • 1
    @Ian Boyd = I don't think that's a bug. Something else to remember is `IN` uses `or` and `NOT IN` uses `AND`. When you are evaluating against a `NULL` with an inequality vs. a known value, you will always get a false since there is no way to know if it matches or not. – JNK Oct 13 '10 at 14:49
  • 3
    @Ian Boyd: I wouldn't call this a bug. `IN` is nothing more than a convenient shorthand for a series of `OR` clauses. Technically, the correct way to think of the expansion given in the other question is: `select 'true' where NOT(3 = 1 or 3 = 2 or 3 = null)`, which is logically equivalent by [DeMorgan's Law](http://en.wikipedia.org/wiki/De_Morgan's_laws). In any case, the fallacy is assuming that a comparison of a value=NULL returns FALSE when in fact the result is UNKNOWN. – Joe Stefanelli Oct 13 '10 at 14:52
  • 5
    Yep. Not a bug. Also it is **not** boolean logic. It is [3 valued logic](http://www.simple-talk.com/sql/learn-sql-server/sql-and-the-snare-of-three-valued-logic/). – Martin Smith Oct 13 '10 at 15:17
  • @JNK, Joe Stefanelli: That's the problem. Converting `IN` to `or`, and `NOT IN` to `and`, changes the logical meaning. – Ian Boyd Oct 13 '10 at 18:28
  • @Ian Boyd - No, logically a list of values for an `IN` clause SHOULD evaluate to a series of `OR` statements. The `NOT IN` also logically evaluates to a series of `AND` statements. What else are you asking when you ask if a value is `NOT IN (a,b,c,d)` than to say it is `NOT a, NOT b, NOT c, AND ALSO NOT d`? – JNK Oct 13 '10 at 18:42
  • @JNK *"i'm thinking about naming my daughter Kirsten, but only if none of my friends have a sister named Kirsten.* The names of my friend's sisters are: `Nicholle`, `Joanne`, *i don't have a sister*, `Sherry`. Excellent, `Kirsten` is `not in` the list of names. Some math geek turned it into boolean algebra, gets the wrong answer, then says that the question itself is flawed. That's fine, but ask anyone who's graduated the 8th grade which answer is right. But it doesn't matter: i got the answer to my question: `NOT IN` does not mean `NOT IN`, it means `and <> ... and <> ... and <>`. – Ian Boyd Oct 16 '10 at 15:09
  • Also, look at the algorithm i used in the original question to evaluate `NOT IN` - that is what any reasonable human being would do when asking if `Kirsten` is not in (`Nicholle`, `Joanne`, *i don't have a sister*, `Sherry`). Database engine designers chose to represent `NOT IN` as `and <> and <> and <>`, which is fine. – Ian Boyd Oct 16 '10 at 15:14
  • @Joe Stefanelli "is nothing more than a convenient shorthand" for something else. My confusion was that i thought `not in` meant `not in`, i didn't realize it was a shorthand for a series of `and <>`. Once that was defined: the results i see makes sense. – Ian Boyd Oct 16 '10 at 15:24
  • 2
    @Ian Boyd - For your sister's name issue, that's a data modelling issue. You should have a subtable `SISTER` that refers back to your main table. If you don't have a sister you wouldn't be included in the result set. Also, `NULL` means unknown. If you have 2 people whose names you don't know, can you say one of their names isn't Kirsten? – JNK Oct 16 '10 at 18:21
  • @JNK: You're thinking like a computer scientist; which is the opposite of a regular person. – Ian Boyd Oct 20 '10 at 15:22
  • @Ian Boyd - Your sister example would work the way you expected if you used a non-ANSI null value, such as an emtpy string (as a dummy null value, not actually null)--or turn off ANSI_NULLS, of course. But your example also doesn't anticipate multiple sisters (I'm not sure if the model suggested by @JNK does, either). If you used a row for each sister with a foreign key back to the corresponding friend, I think the corresponding query you would then construct with an INNER JOIN would correctly ignore friends with no sisters and produce the correct list of names with no nulls in it. – Rob Parker Apr 20 '11 at 18:46

2 Answers2

13

Common problem, canned answer:

The behavior of NOT IN clause may be confusing and as such it needs some explanations. Consider the following query:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL)

Although there are more than a thousand distinct last names in AdventureWorks.Person.Contact, the query returns nothing. This may look counterintuitive to a beginner database programmer, but it actually makes perfect sense. The explanation consist of several simple steps. First of all, consider the following two queries, which are clearly equivalent:

SELECT LastName, FirstName FROM Person.Contact

WHERE LastName IN('Hedlund', 'Holloway', NULL)



SELECT LastName, FirstName FROM Person.Contact

WHERE LastName='Hedlund' OR LastName='Holloway' OR LastName=NULL

Note that both queries return expected results. Now, let us recall DeMorgan's theorem, which states that:

not (P and Q) = (not P) or (not Q)

not (P or Q) = (not P) and (not Q)

I am cutting and pasting from Wikipedia (http://en.wikipedia.org/wiki/De_Morgan_duality). Applying DeMorgan's theorem to this queries, it follows that these two queries are also equivalent:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL)



SELECT LastName, FirstName FROM Person.Contact

WHERE LastName<>'Hedlund' AND LastName<>'Holloway' AND LastName<>NULL

This last LastName<>NULL can never be true

A-K
  • 16,804
  • 8
  • 54
  • 74
  • 1
    DeMorgan's theorem applies to boolean logic, not to 3-valued logic. For example: `not (unknown or true) = false`, but `(not unknown) and (not true) = unknown`, violating the theorem – Andomar Oct 13 '10 at 16:12
  • My confusion comes from the second "equivalent" set of queries. i would argue that they're not equivalent - since they return different answers. But my argument is irrelevant, since the ANSI guys *defined* the first is an alias of the 2nd. – Ian Boyd Oct 16 '10 at 15:19
  • Accepted this answer, as it points out `NOT IN` is defined to be an alias for `(x<>a) AND (x<>b) AND (x<>c) ...`. – Ian Boyd Oct 16 '10 at 15:26
  • 4
    @Andomar - Actually, I believe it goes: (not unknown) and (not true) = unknown and false = false, which is the same result as: not (unknown or true) = not (true) = false. Similarly, not (unknown or false) = not (unknown) = unknown, which is the same result as: not (unknown and true) = not (unknown) = unknown. So the theorem holds for 3-valued logic, provided the axioms: unknown or true = true; unknown and false = false; – Rob Parker Apr 20 '11 at 19:05
10

The assumption in your first sentence isn't right:

Why does the comparison of value to null return false, except when using a NOT IN, where it returns true?

But comparison of a value to null does not return false; it returns unknown. And unknown has its own logic:

unknown  AND  true   = unknown
unknown  OR   true   = true
unknown  OR   false  = unknown

One example of how this works out:

where 1 not in (2, null)
--> where 1 <> 2 and 1 <> null
--> where true and unknown
--> where unknown

The where clause only matches on true, so this filters out any row.

You can find the full glory of 3 value logic at Wikipedia.

Andomar
  • 232,371
  • 49
  • 380
  • 404