2

The Problem

I need to find and remove all of the equal rows from two tables who's schemas may often change. In this case, equal meaning all values in the rows are equal to each other, for example:

ID Message Date = ID Message Date
1 'Hello' 1/1/1999 == 1 'Hello' 1/1/1999
2 'Hello' 1/1/1999 != 2 'Goodbye' 1/1/1999
3 'Goodbye' null == 3 'Goodbye' null

I would like to do so utilizing the EXCEPT keyword, because it does not require me to specify the equality condition for each column and update the query when it has non-key column names changes. I am aware that using wildcards in selects is a bad code smell, however both of these tables will always have the same non-hidden columns, and writing the procedure this way will make the code much easier to maintain.

A possibly important note, one of the tables Table1 is a Temporal table, and Table2 is not. However, the history for Table1 is stored in a separate table, and the queries are not utilizing SYSTEM_TIME (ie, querying current data only).

What I've tried

Table1 is created with the following script:

CREATE TABLE Table1 (
  ID INT NOT NULL PRIMARY KEY
  , ValidFrom DATETIME2 GENERATED ALWAYS AS ROW START HIDDEN NOT NULL
  , ValidTo DATETIME2 GENERATED ALWAYS AS ROW END HIDDEN NOT NULL
  , PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
)
WITH (SYSTEM_VERSIONING = ON)

Table2 is created with the following script:

CREATE TABLE Table2 (
  ID INT NOT NULL PRIMARY KEY
)

What I have been trying is the following, which I think should be deleting all entries from Table1 for which an exact match across all columns exists in Table2:

DELETE FROM [Table1]
  WHERE [Table1].[ID] NOT IN (
    SELECT ID FROM (
      SELECT * FROM [Table1]
      EXCEPT
      SELECT * FROM [Table2]
    ) AS INNER_QUERY
  )

However, the SQL Server gives the following unexpected error:

All queries combined using a UNION, INTERSECT or EXCEPT operator must have an equal number of expressions in their target lists.

Yet, the following query works as expected with no issues

SELECT * FROM [Table1]
  WHERE [Table1].[ID] NOT IN (
    SELECT ID FROM (
      SELECT * FROM [Table1]
      EXCEPT
      SELECT * FROM [Table2]
    ) AS INNER_QUERY
  )
Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • We need a [mre]. Please provide some minimal table definitions that allow reproduction of this issue. – MatBailie Mar 20 '23 at 20:35
  • That is what I have given, this is the simplified example. Two tables with simply ID columns, one temporal and the other not give this behavior for me – Violet Rosenzweig Mar 20 '23 at 20:37
  • maybe related https://stackoverflow.com/q/75358643/73226 – Martin Smith Mar 20 '23 at 20:39
  • @MartinSmith that may be, but I am desiring the behavior of not querying the Hidden history columns, as `Table2` doesn't have these columns – Violet Rosenzweig Mar 20 '23 at 20:40
  • yeah - I'm just saying related in that treatment of these hidden columns is clearly a bit flaky – Martin Smith Mar 20 '23 at 20:41
  • Sure. I'm looking into that now, as that was my suspicion as well when this issue happened. The confusion for me is from the fact that the simple `EXCEPT` query works in isolation, it only causes issues when it is appearing as a subquery. – Violet Rosenzweig Mar 20 '23 at 20:42
  • 1
    Looks like a bug. Is using BINARY_CHECKSUM(*) solves the duplicate checking for you? It doesn't seem to "take" the hidden columns – siggemannen Mar 20 '23 at 21:03
  • @siggermannen I'll try `HASHBYTES`, as these tables are very large and I cannot have an accidental collision. – Violet Rosenzweig Mar 20 '23 at 21:16
  • 1
    You could maybe just break it up and do `SELECT ID INTO #Ids FROM (SELECT * FROM [Table1] EXCEPT SELECT * FROM TEST_NONTEMPORAL) AS INNER_QUERY` and then reference that – Martin Smith Mar 20 '23 at 21:22
  • @MartinSmith that's a good workaround, I'll go with that for now. – Violet Rosenzweig Mar 20 '23 at 21:26
  • Why not just specify all the columns properly [like you are supposed to](https://stackoverflow.com/questions/3639861/why-is-select-considered-harmful) – Charlieface Mar 20 '23 at 22:01
  • It does seem to be abug, repro here https://dbfiddle.uk/hQzfABEx Side note that you don't need to join back `INNER_QUERY` you can delete from it directly (this won't solve the bug though) – Charlieface Mar 20 '23 at 22:09
  • Why not just specify all the columns properly [like you are supposed to](https://stackoverflow.com/questions/3639861/why-is-select-considered-harmful)? Remember that if someone makes a change to the table definition then you also need to do `sp_refreshsqlmodule` otherwise you still won't get the right results. – Charlieface Mar 20 '23 at 22:10
  • @Charlieface - there is no need for `sp_refreshsqlmodule` here. If a table definition is altered then any execution plans referencing it are automatically recompiled. There is no cached metadata (like with columns for a view) – Martin Smith Mar 20 '23 at 22:13
  • 1
    @Charlieface I mentioned this in the question as to why, but if you need more detail, this is for a small local database that interfaces between a few products, internal and external teams, but is not used for any live application directly. It's just a medium through which data is being transferred in an agreed upon format. And that format will likely need to be changed, added to, etc in the future. This script ALWAYS will be run with two tables who have the exact same column structure. In fact, it is DISIRED for it to cause errors if Table1 and Table2 do not have the exact same layout – Violet Rosenzweig Mar 21 '23 at 00:05
  • Using the wildcard means that first off, this function can be made to take table names as dynamic SQL parameters, and second off, it never has to be updated as columns change so long as the key column's name remains the same. – Violet Rosenzweig Mar 21 '23 at 00:07
  • Further @Charlieface you can look at the second answer of the question you linked, and it points out exactly this scenario, "when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference." Remember, code smells are just something that may signal something wrong, but it is not always bad. – Violet Rosenzweig Mar 21 '23 at 00:09
  • ..you could append a temporal clause to table1..eg `NOT IN ( SELECT ID FROM ( SELECT * FROM [Table1] FOR SYSTEM_TIME AS OF '99991231 23:59:59.9999998' EXCEPT SELECT * FROM [Table2])...` (<—the last digit is 8) or without the time `FOR SYSTEM_TIME AS OF '99991231'` – lptr Mar 21 '23 at 12:22
  • @Iptr that's an interesting solution. I don't have the time to test it, but if you can verify with the given tables that it works you should add an answer – Violet Rosenzweig Mar 21 '23 at 12:38

2 Answers2

1

This is clearly a bug.

The "hidden columns" abstraction is sometimes a bit leaky (other example).

One workaround would be to just break it up into two steps

SELECT ID
INTO   #Ids
FROM   (SELECT *
        FROM   [Table1]
        EXCEPT
        SELECT *
        FROM   [Table2]) AS INNER_QUERY

DELETE FROM [Table1]
WHERE  [Table1].[ID] NOT IN (SELECT ID
                             FROM   #Ids) 

I'm unclear about the circumstances where the Bug manifests itself.

When the exception is thrown for the DELETE it is during algebrization and the call stack looks as follows

enter image description here

When doing the SELECT ... INTO and putting a break point on CRelOp_IntersectExcept::PexprBindSelf the call stack looks as follows (with no error thrown). Perhaps CRelOp_SelectQuery::BindTree has some extra logic for these hidden columns (?!).

enter image description here

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • I cannot seem to find a way to report this bug to Microsoft so that they can fix it. If it's already been reported, that's fine, but I'd like to bring it to their attention if not. – Violet Rosenzweig Mar 20 '23 at 21:33
  • 1
    You can report it as general feedback here https://feedback.azure.com/d365community/forum/04fe6ee0-3b25-ec11-b6e6-000d3a4f0da0 - but otherwise you need some sort of support agreement I believe https://learn.microsoft.com/en-us/sql/sql-server/sql-server-get-help?view=sql-server-ver15 – Martin Smith Mar 20 '23 at 21:37
  • @Charlieface - the direct delete statement would spool the rows anyway for Halloween protection so much the same – Martin Smith Mar 20 '23 at 22:21
  • Good point, but I believe a spool is less overhead than a full temp table, might be wrong (again!) – Charlieface Mar 20 '23 at 22:25
  • I'm sure I have seen cases (blogs) where a temp table to provide manual phase separation performed better but I can't remember why that would be the case (if I didn't dream it in the first place!) – Martin Smith Mar 20 '23 at 22:30
  • https://dba.stackexchange.com/a/230729/3690 – Martin Smith Mar 20 '23 at 22:32
  • 1
    @MartinSmith, how are you able to do this debugging stuff against sqllang.dll? – siggemannen Mar 21 '23 at 09:35
  • 1
    @siggemannen - Set up Visual Studio to get symbols from the Microsoft Symbol Server, attach to SQL Server process with "native code" selected, tell it to break on all exceptions and then run the SQL code that throws the error and look at the call stack - in this case it wasn't massively informative but sometimes it is – Martin Smith Mar 21 '23 at 09:36
  • 2
    Oh ok, i thought one needed access to some microsoft magic :) Thanks a lot, it's nice to be able to see what the server does in the backend – siggemannen Mar 21 '23 at 09:43
1

As mentioned this appears to be a bug.

I understand you don't want to have to specify each column. But you can instead use dynamic SQL, which avoids the overhead of writing the IDs to a temp table each time.

DECLARE @cols nvarchar(max) = (
    SELECT STRING_AGG('t.' + QUOTENAME(c1.name), ', ')
    FROM sys.tables t1
    JOIN sys.columns c1 ON t1.object_id = c1.object_id
    JOIN sys.tables t2
      ON t2.name = 'Table2'
    JOIN sys.columns c2 ON t2.object_id = c2.object_id
      AND c2.name = c1.name
    WHERE t1.name = 'Table1'
      AND c1.is_hidden = 0
      AND c2.is_hidden = 0
);

DECLARE @sql nvarchar(max) = '
DELETE t
FROM Table1 t
WHERE NOT EXISTS (
    SELECT ' + @cols + '
    EXCEPT
    SELECT ' + @cols + '
    FROM Table2 t
);
';

PRINT @sql;  -- your friend

EXEC sp_executesql @sql;

db<>fiddle

Note that I've not used a FROM in the first half of the subquery, this avoids a second scan of the table.

Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • This is a valid solution as well, though I am fine with a temporary table and prefer it to dynamic sql – Violet Rosenzweig Mar 21 '23 at 00:00
  • I feel the need to expand on this reply, I prefer the temporary table from Martin because it is concise and clear, and it will already be appearing in a dynamic SQL statement (for the table names) so I want it to be very obvious to future devs what it does if it ever breaks. But thank you for the answer as well, it is equally valid and has it's own benefits, too – Violet Rosenzweig Mar 21 '23 at 00:14