The short answer:
There is a NULL within the collection returned by your subquery. You can solve the problem by removing that NULL value before finishing the subquery or to use NOT EXISTS
predicate instead of NOT IN
, as it does it implicitly.
The long answer (From T-SQL Fundamentals, Third edition, by Itzik Ben-Gan)
This is an example: Imagine there is a order with a NULL orderid inside Sales.Orders table, so the subquery returns some integers, and a NULL value.
SELECT custid, companyname
FROM Sales.Customers
WHERE custid NOT IN(SELECT O.custid
FROM Sales.Orders AS O);
The explanation on why the query from above returns an empty set:
Obviously, the culprit here is the NULL
customer ID you added to the Orders table. The NULL
is one of the elements returned by the subquery.
Let’s start with the part that does behave like you expect it to. The IN predicate returns TRUE for a customer who placed orders (for example, customer 85), because such a customer is returned by the subquery. The NOT operator negates the IN
predicate; hence, the NOT TRUE
becomes FALSE
, and the customer is discarded. The expected behavior here is that if a customer ID is known to appear in the Orders table, you know with certainty that you do not want to return it.
However (take a deep breath), if a customer ID from Customers doesn’t appear in the set of non-NULL customer IDs in Orders, and there’s also a NULL
customer ID in Orders, you can’t tell with certainty that the customer is there—and similarly you can’t tell with certainty that it’s not there. Confused? I hope I can clarify this explanation with an example.
The IN
predicate returns UNKNOWN
for a customer such as 22 that does not appear in the set of known customer IDs in Orders. That’s because when you compare it with known customer IDs you get FALSE, and when you compare it with a NULL you get UNKNOWN
. FALSE
OR UNKNOWN
yields UNKNOWN
. Consider the expression 22 NOT IN (1, 2, <other non-22 values>, NULL)
. This expression can be rephrased as NOT 22 IN (1, 2, …, NULL)
. You can expand this expression to NOT (22 = 1 OR 22 = 2 OR … OR 22 = NULL)
. Evaluate each individual expression in the parentheses to its truth value and you get NOT (FALSE OR FALSE OR … OR UNKNOWN)
, which translates to NOT UNKNOWN
, which evaluates to UNKNOWN
.
The logical meaning of UNKNOWN
here, before you apply the NOT
operator, is that it can’t be
determined whether the customer ID appears in the set, because the NULL
could represent that
customer ID. The tricky part here is that negating the UNKNOWN
with the NOT
operator still yields
UNKNOWN
. This means that in a case where it is unknown whether a customer ID appears in a set, it
is also unknown whether it doesn’t appear in the set. Remember that a query filter discards rows that
get UNKNOWN
in the result of the predicate.
In short, when you use the NOT IN
predicate against a subquery that returns at least one NULL
,
the query always returns an empty set. So, what practices can you follow to avoid such trouble? First,
when a column is not supposed to allow NULLs
, be sure to define it as NOT NULL
. Second, in all
queries you write, you should consider NULLs and the three-valued logic. Think explicitly about
whether the query might process NULLs, and if so, whether SQL’s treatment of NULLs is correct for
you. When it isn’t, you need to intervene. For example, our query returns an empty set because of the
comparison with the NULL
. If you want to check whether a customer ID appears only in the set of
known values, you should exclude the NULLs—either explicitly or implicitly. To exclude them explicitly,
add the predicate O.custid
IS NOT NULL to the subquery, like this:
SELECT custid, companyname
FROM Sales.Customers
WHERE custid NOT IN(SELECT O.custid
FROM Sales.Orders AS O
WHERE O.custid IS NOT NULL);
You can also exclude the NULLs implicitly by using the NOT EXISTS
predicate instead of NOT IN
,
like this:
SELECT custid, companyname
FROM Sales.Customers AS C
WHERE NOT EXISTS
(SELECT *
FROM Sales.Orders AS O
WHERE O.custid = C.custid);
Recall that unlike IN
, EXISTS
uses two-valued predicate logic. EXISTS
always returns TRUE
or
FALSE
and never UNKNOWN
. When the subquery stumbles into a NULL
in O.custid
, the expression
evaluates to UNKNOWN
and the row is filtered out. As far as the EXISTS
predicate is concerned, the
NULL
cases are eliminated naturally, as though they weren’t there. So EXISTS
ends up handling only
known customer IDs. Therefore, it’s safer to use NOT EXISTS
than NOT IN
.
The information above is taken from Chapter 4 - Subqueries, T-SQL Fundamentals, Third edition