0

I'm analyzing the concept of anti semi-join in the SQL Server,

The Microsoft documentation that describes the keyword ANTISEMIJOIN of the U-SQL syntax quotes the following query for SQL:

SELECT * FROM A 
WHERE A.Key NOT IN (SELECT B.Key FROM B)

Others suggest the following query:

SELECT * FROM A 
WHERE NOT EXISTS (SELECT 1 FROM B WHERE A.Key = B.Key)

Both of them seem to return the same records, the query proposed by Microsoft seems to have better performance, however I found a case in which the two queries give different results and I don't know which of them is the right one. I show it with an example:

DECLARE @A TABLE ([OnA] [int], [DataA] [text])
DECLARE @B TABLE ([OnB] [int], [DataB] [text])

INSERT INTO @A VALUES (1, 'A1'), (2, 'A2'), (NULL, 'A3')
INSERT INTO @B VALUES (1, 'B1'), (3, 'B2'), (NULL, 'B3')

--LEFT SEMI-JOIN
SELECT * FROM @A 
WHERE OnA IN (SELECT OnB FROM @B)           
--Returns with ANSI_NULLS ON:  1 | A1              (Correct!)       
--Returns with ANSI_NULLS OFF: 1 | A1, NULL | A3   (Correct!)

SELECT * FROM @A 
WHERE EXISTS (SELECT 1 FROM @B WHERE OnA = OnB)  
--Returns with ANSI_NULLS ON:  1 | A1              (Correct!) 
--Returns with ANSI_NULLS OFF: 1 | A1              (Not correct!)("ANSI_NULLS OFF" insensitive!)

--LEFT ANTISEMI-JOIN
SELECT * FROM @A 
WHERE OnA NOT IN (SELECT OnB FROM @B)                
--Returns with ANSI_NULLS ON:  empty               (Not Correct and very dangerous!)                 
--Returns with ANSI_NULLS OFF: 2 | A2              (Correct!)

SELECT * FROM @A 
WHERE NOT EXISTS (SELECT 1 FROM @B WHERE OnA = OnB) 
--Returns with ANSI_NULLS ON:  2 | A2, NULL | A3   (Correct!)
--Returns with ANSI_NULLS OFF: 2 | A2, NULL | A3   (Not correct!)("ANSI_NULLS OFF" insensitive!)

It's evident that with the presence of NULL in the correspondence of the key columns, the query left antisemi-join quoted by Microsoft doesn't return (in normal case: when ANSI_NULLS ON) the complementary result to the left semi-join query and this is already a bug. In addition, zero records return always, it's very serious!

On the other hand, the other formulas with the EXIST and NOT EXIST operator also don't seem to be consistent with the "ANSI_NULLS OFF" setting. This error affects also the results of the "left/right semi join" operations!

So, I assume that the "semi join" operation is ANSI_NULLS proof in T-SQL/SQL Server, but it should be done only in this way:

--LEFT SEMI-JOIN
SELECT * FROM @A 
WHERE OnA IN (SELECT OnB FROM @B)

While the "antisemi join" operation is not safe and not ANSI_NULLS proof, so we have the responsibility to use the right variant in certainly static contexts:

--LEFT ANTISEMI-JOIN (when ANSI_NULLS is ON)
SELECT * FROM @A 
WHERE NOT EXISTS (SELECT 1 FROM @B WHERE OnA = OnB)

--LEFT ANTISEMI-JOIN (when ANSI_NULLS is OFF and it's possible that there is at least one record with NULL in key value)
SELECT * FROM @A 
WHERE OnA NOT IN (SELECT OnB FROM @B)

Do you all agree with me?

These are my questions:

  1. Would it be right to handle the "NullVsNull" eventuality with SET ANSI_NULLS rather than an explicit query?

  2. What would be the best single query to faithfully reflect the ANTI SEMI JOIN behavior regardless of the setting ANSI_NULLS?

  3. Is it possible that Microsoft and Others are both making a mistake?

  4. Why does the EXIST operator seem to be insensitive to "ANSI_NULLS OFF" setting?

  5. Why does the NOT IN operator always return EMPTY in the fifth case?

J. Mini
  • 1,868
  • 1
  • 9
  • 38
jangix
  • 49
  • 7
  • In MS SQL this is generally accepted as _By Design_ you can't and should not try to use Null comparison in this way when `SET ANSI_NULLS ON`. Nulls in this mode effectively excludes the null value records from comparison at all, unless you use IS NULL or coalescing functions. Use `SET ANSI_NULLS OFF` if you want nulls to participate in direct value comparisons as values. https://docs.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql?view=sql-server-ver15#:~:text=SQL%20Server%20returns%20an%20error%20that%20lists%20all,no%20such%20indexes%20on%20the%20tables%20or%20views. – Chris Schaller Mar 28 '21 at 22:28
  • With SET ANSI_NULLS OFF the NullVsNull case is considered a match and I don't think it should be, on the contrary if I forget to set ANSI_NULLS OFF I would always get an empty list and that I think is even more dangerous. In my opinion, the query with the NOT EXISTS clause remains the most precise null-proof regardless of the ANSI_NULLS. So why didn't Microsoft mention this query? Perhaps the U-SQL engine mirrors the query with the NOT IN clause? I'd like to try.. – jangix Mar 28 '21 at 23:22
  • I'm not suggesting that ANSI_NULLS OFF is a good solution, but its one that works, if nulls are involved use NOT EXISTS and or IS NOT NULL, or `COALESCE` the values that might be null (or `ISNULL`) to an arbitrary value to make comparisons work, you only asked why there is a difference, this isn't specifically acknowledged in the docs because it is inferred knowledged about JOIN and comparison logic in general for MS SQL Server, many of us build queries specifcally around this type of evaluation behaviour. (I usually `IsNull(expr, -1)` when I need to compare to an expected null value) – Chris Schaller Mar 29 '21 at 00:16
  • It's not a mistake or an error, this is by design. The following article explains: http://www.sqlbadpractices.com/using-not-in-operator-with-null-values/ – Sean Pearce Mar 29 '21 at 09:46
  • Does this answer your question? [NOT IN vs NOT EXISTS](https://stackoverflow.com/questions/173041/not-in-vs-not-exists) – Charlieface Mar 30 '21 at 01:54
  • Thank you @Charlieface it is a great article but I haven't found yet all the answers, I updated the main question. – jangix Mar 30 '21 at 13:43

2 Answers2

1

You already understand the idea why it's happening. I found a good article in below link. Please check it out.

Microsoft might not have made a mistake. According to below article Microsoft SQL Server handles null differently than ANSI standard if you set ansi_nulls to off.

Use below line before your query

set ansi_nulls off

And not in() will return your desired result though there is null in the list.

For further reading:

https://www.sqlservercentral.com/articles/four-rules-for-nulls

In the provided Microsoft's documentation they also mentioned:

Consider using SEMIJOIN (U-SQL) for situations where you would use a subquery with IN in SQL. Consider using ANTISEMIJOIN (U-SQL) for situations where you would use a subquery with NOT IN in SQL

Dale K
  • 25,246
  • 15
  • 42
  • 71
  • Link only answers on considered poor quality on this site and may get deleted. You need to quote the relevant part of the link such that the answer is stand along in the case of the links going down. – Dale K Mar 28 '21 at 22:27
  • @DaleK I have removed screenshot. Since I am just roaming around fir couple of months may I know when to use images in answer? – Kazi Mohammad Ali Nur Romel Mar 28 '21 at 23:00
  • @DaleK Is the answer appropriate now? Or it needs more clarity? Would really appreciate your valuable suggestion. – Kazi Mohammad Ali Nur Romel Mar 28 '21 at 23:02
  • 1
    Microsoft's error may not have mentioned this conditionality of SET ANSI_NULLS OFF. Not a small mistake I would say... – jangix Mar 28 '21 at 23:27
  • They misrepresented null related infprmation in books online also. – Kazi Mohammad Ali Nur Romel Mar 28 '21 at 23:32
  • 1
    @KaziMohammadAliNur about the only time a screenshow is appropriate is if you are building a user interface and need to show how it looks. (Or for example showing the rendering of a SSRS report). Anything else should be appropriately formatted text. – Dale K Mar 28 '21 at 23:36
  • 1
    @jangix - you should forget `SET ANSI_NULLS OFF` exists. Implementing that was a mistake made > 20 years ago. Just concentrate on standard SQL and, yes, the behaviour of `NOT IN` vs `NOT EXISTS` is different in the presence of nulls. As for which one, if any, mirrors the USQL behaviour I don't know – Martin Smith Mar 29 '21 at 14:10
1

First of all, here's a Fiddle to help play with all this:

http://sqlfiddle.com/#!18/aa477/20

Now we can talk about what's going on. The answer here starts with a correct understanding of what NULL means.

People often think of NULL as meaning, "this value is empty", but this understanding is incomplete. It is better to think of NULL as meaning, "I don't know what this value is".

With the former (empty) understanding, comparing two NULL values ({empty} == {empty}) could be reasonably seen to produce true. But with the new correct (I don't know) understanding of NULL, we can see it makes much less sense to assume one {I don't know} will be equal to another {I don't know}. The correct answer to this comparison is more obviously still {I don't know}. Therefore NULL == NULL is still... NULL. Furthermore, NULL is falsy when forced into a boolean expression.


This is what the ANSI standard defines, what you should stick with pretty much all the time, and also the correct way to understand how NULL formally interacts with any of the mathematical/relational algebra join concepts.


If I really wanted to formally express this join type, I'd use the NOT EXISTS() variant. You could also write it like this:

SELECT @A.* 
FROM @A 
LEFT JOIN @B ON OnA = OnB
WHERE  @B.OnB IS NULL

We sometimes call this an Exclusion Join, but NOT EXISTS() also typically executes faster than the LEFT JOIN.

Again, for all of these options, understanding what is correct still hinges on a correct understanding of the NULL comparisons. Since the NULL from the A3 row is not known to be the same value as the NULL from the B3 row, that record should be included.

However, (and this is important), all this is in the theoretical abstract. In the practical concrete world, the correct result for your application depends on the meaning of the value. Sometimes a particular system, application, developer, or data model will decide NULL really does just mean "empty" for this purpose. Whether this is wise, or formally correct, is irrelevant. What is relevant is understanding the meaning of the data you're looking at and producing query results that answer the questions asked by your requirements.


There is still the question of whether this query shows a mistake or bug with Sql Server:

SELECT * FROM @A 
WHERE EXISTS (SELECT 1 FROM @B WHERE OnA = OnB) 

With ANSI_NULLS on (the default), things are correct. We don't know the NULL values are the same, and so the record should not be included. However, we the same results when ANSI_NULLS are off, even though we'd expect the NULL = NULL comparison to include that row. You may be correct about a bug here, though people smarter than me may explain why it isn't. But if so, I'm not really bothered by it. In my opinion, if you turn off ANSI_NULLS you're already playing with fire and deserve to run into this kind of thing.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794