I'm analyzing the concept of anti semi-join in the SQL Server,
The Microsoft documentation that describes the keyword ANTISEMIJOIN of the U-SQL syntax quotes the following query for SQL:
SELECT * FROM A
WHERE A.Key NOT IN (SELECT B.Key FROM B)
Others suggest the following query:
SELECT * FROM A
WHERE NOT EXISTS (SELECT 1 FROM B WHERE A.Key = B.Key)
Both of them seem to return the same records, the query proposed by Microsoft seems to have better performance, however I found a case in which the two queries give different results and I don't know which of them is the right one. I show it with an example:
DECLARE @A TABLE ([OnA] [int], [DataA] [text])
DECLARE @B TABLE ([OnB] [int], [DataB] [text])
INSERT INTO @A VALUES (1, 'A1'), (2, 'A2'), (NULL, 'A3')
INSERT INTO @B VALUES (1, 'B1'), (3, 'B2'), (NULL, 'B3')
--LEFT SEMI-JOIN
SELECT * FROM @A
WHERE OnA IN (SELECT OnB FROM @B)
--Returns with ANSI_NULLS ON: 1 | A1 (Correct!)
--Returns with ANSI_NULLS OFF: 1 | A1, NULL | A3 (Correct!)
SELECT * FROM @A
WHERE EXISTS (SELECT 1 FROM @B WHERE OnA = OnB)
--Returns with ANSI_NULLS ON: 1 | A1 (Correct!)
--Returns with ANSI_NULLS OFF: 1 | A1 (Not correct!)("ANSI_NULLS OFF" insensitive!)
--LEFT ANTISEMI-JOIN
SELECT * FROM @A
WHERE OnA NOT IN (SELECT OnB FROM @B)
--Returns with ANSI_NULLS ON: empty (Not Correct and very dangerous!)
--Returns with ANSI_NULLS OFF: 2 | A2 (Correct!)
SELECT * FROM @A
WHERE NOT EXISTS (SELECT 1 FROM @B WHERE OnA = OnB)
--Returns with ANSI_NULLS ON: 2 | A2, NULL | A3 (Correct!)
--Returns with ANSI_NULLS OFF: 2 | A2, NULL | A3 (Not correct!)("ANSI_NULLS OFF" insensitive!)
It's evident that with the presence of NULL in the correspondence of the key columns, the query left antisemi-join quoted by Microsoft doesn't return (in normal case: when ANSI_NULLS ON) the complementary result to the left semi-join query and this is already a bug. In addition, zero records return always, it's very serious!
On the other hand, the other formulas with the EXIST and NOT EXIST operator also don't seem to be consistent with the "ANSI_NULLS OFF" setting. This error affects also the results of the "left/right semi join" operations!
So, I assume that the "semi join" operation is ANSI_NULLS proof in T-SQL/SQL Server, but it should be done only in this way:
--LEFT SEMI-JOIN
SELECT * FROM @A
WHERE OnA IN (SELECT OnB FROM @B)
While the "antisemi join" operation is not safe and not ANSI_NULLS proof, so we have the responsibility to use the right variant in certainly static contexts:
--LEFT ANTISEMI-JOIN (when ANSI_NULLS is ON)
SELECT * FROM @A
WHERE NOT EXISTS (SELECT 1 FROM @B WHERE OnA = OnB)
--LEFT ANTISEMI-JOIN (when ANSI_NULLS is OFF and it's possible that there is at least one record with NULL in key value)
SELECT * FROM @A
WHERE OnA NOT IN (SELECT OnB FROM @B)
Do you all agree with me?
These are my questions:
Would it be right to handle the "NullVsNull" eventuality with SET ANSI_NULLS rather than an explicit query?
What would be the best single query to faithfully reflect the ANTI SEMI JOIN behavior regardless of the setting ANSI_NULLS?
Is it possible that Microsoft and Others are both making a mistake?
Why does the EXIST operator seem to be insensitive to "ANSI_NULLS OFF" setting?
Why does the NOT IN operator always return EMPTY in the fifth case?