2

I'm building a Data Warehouse and I found a problem in two tables data comparison statement. I use EXCEPT operator to compare the tables which have clustered indexes (normal int field as key). My problem is that in query execution plan there are sort operators after both clustered indexes scan. Here's is a code example:

create table temp.table_a
(
    key_a int identity,
    some_field_a int,
    some_field2_a varchar(10)
);

insert into temp.table_a
(
    some_field_a,
    some_field2_a
)
select
    n,
    'abcd'
from meta.GENERATE_SEQUENCE(1,1000);

create clustered index cix_table_a_key_a on temp.table_a (key_a);


create table temp.table_b
(
    key_b int identity,
    some_field_b int,
    some_field2_b varchar(10)
);

insert into temp.table_b
(
    some_field_b,
    some_field2_b
)
select 
    n,
    'abcd'
from meta.GENERATE_SEQUENCE(1,1000);

create clustered index cix_table_b_key_b on temp.table_b (key_b);

(GENERATE_SEQUENCE is a row generator)

Now the EXCEPT query:

select 
    key_a,
    some_field_a,
    some_field2_a
from temp.table_a

except

select 
    key_b,
    some_field_b,
    some_field2_b
from temp.table_b

Here's an image of the execution plan:

[Execution plan]

I'm aware that Merge Join needs sorted input, but isn't it already sorted enough? By this I mean that the only sorted columns we need is key_a/key_b. And this is already done because of clustered indexes. Sorts of other columns are not needed because inside every value of key_a/key_b there is only one row - nothing to sort.

So, my question are:

  1. Why there are sort operators after clustered index scans in this situation?
  2. How can I avoid these sorts when I want to use EXCEPT operator?
  3. What are the better ways (if there is any) of doing table comparison?

Thanks in advance for your answers :)

Amira Bedhiafi
  • 8,088
  • 6
  • 24
  • 60
dominjas
  • 31
  • 4
  • Instead of as an image, upload the actual execution plan xml to [Paste The Plan](https://www.brentozar.com/pastetheplan/). To avoid the sort operator, the clustered index should be on all the columns listed in the `SELECT` clause in order to provide the ordering needed for a merge operator. – Dan Guzman Sep 02 '19 at 09:33
  • Here is a Paste the Plan link to actual execution plan: https://www.brentozar.com/pastetheplan/?id=Hk35rw9BB – dominjas Sep 02 '19 at 09:51
  • "Sorts of other columns are not needed because inside every value of key_a/key_b there is only one row - nothing to sort" -- but `EXCEPT` operates on whole rows, not columns, and there is no way for the engine to know that it doesn't "need" to sort the other columns because you promise not to have different data between A and B. – Jeroen Mostert Sep 02 '19 at 09:54
  • Building such an index for all columns does not make any sense, becouse on my real tables I have more than 20 columns. So there is nothing I can do? Is there some reason why optimizer does not know that all is already sorted even if I the only column explicitly sorted is index key? – dominjas Sep 02 '19 at 09:56
  • 1
    The data *isn't* already sorted. It's sorted *only* on the first column. Consider: `(1, a, banana) EXCEPT (1, z, pineapple)`. Is it enough to consider the first column? No. You can have an `EXCEPT` without sorting if you only `EXCEPT` the first column and then look up the other data again. – Jeroen Mostert Sep 02 '19 at 09:57
  • @JeroenMostert thanks, that the answer for my last comment question. – dominjas Sep 02 '19 at 09:59
  • 1
    @dominjas, if the clustered index key is unique, you could try using `NOT EXISTS` instead of `EXCEPT`. Declare the clustered index as unique as well in that case. – Dan Guzman Sep 02 '19 at 10:15
  • @DanGuzman, your comment made that I realized one key thing. I forgot that my key column for clustered index does not have to be unique, because I wrote CREATE CLUSTERED INDEX and not CREATE UNIQUE CLUSTERED INDEX. After that there were no sorts on plan. So the optimizer is wise enough to know this :) – dominjas Sep 02 '19 at 10:20
  • @dominjas, specifying unique is an often overlooked detail, one reason I prefer to declare primary keys and unique constraints instead. I'm curious as to what plan operator is now used to implement except in your plan. Hash join? – Dan Guzman Sep 02 '19 at 10:44
  • @DanGuzman, no, still Merge Join. Previous sorts where there becasue key_a and key_b weren't unique. After changing it to unique there was no need to sort some_field and some_field2 - in the plan Ordered property in both inputs is set to True -> http://sqlfiddle.com/#!18/a3f16/1 – dominjas Sep 02 '19 at 11:44
  • 1
    @dominjas, thanks for sharing. I think you'll get the same plan with NOT EXISTS because SQL Server understands the unique index then makes the queries semantically identical. – Dan Guzman Sep 02 '19 at 12:29

2 Answers2

2

To answer your question isn't it already sorted enough? - No. The comparison is performed on all columns in the SELECT so that all columns need to be included in the sort.

I have included 2 possible solutions, one is to add all columns to the index, the other is to use NOT EXISTS - note this could return duplicate rows if there are duplicates in table_a.

1.) include these columns in the indexes, you will see in the SQLFiddle linked queryplan that no sort is used in this scenario.

SQL Fiddle

MS SQL Server 2017 Schema Setup:

create table table_a
(
    key_a int identity,
    some_field_a int,
    some_field2_a varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_a
(
    some_field_a,
    some_field2_a
)
select
    n,
    'abcd'
from Tally;

create clustered index cix_table_a_key_a on table_a (key_a,some_field_a,
    some_field2_a);


create table table_b
(
    key_b int identity,
    some_field_b int,
    some_field2_b varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_b
(
    some_field_b,
    some_field2_b
)
select 
    n,
    'abcd'
from Tally;

create clustered index cix_table_b_key_b on table_b (key_b, some_field_b, some_field2_b);

Query 1:

select 
    key_a,
    some_field_a,
    some_field2_a
from table_a

except

select 
    key_b,
    some_field_b,
    some_field2_b
from table_b

Results:

2.) Another alternative is to use NOT EXISTS instead of EXCEPT - it should be noted that these are not exactly the same as an EXCEPT effectively performs a DISTINCT too. If you add a DISTINCT to a NOT EXISTS you will get a SORT on the smaller results table.

See SQLFIDLLE here:

SQL Fiddle

MS SQL Server 2017 Schema Setup:

create table table_a
(
    key_a int identity,
    some_field_a int,
    some_field2_a varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_a
(
    some_field_a,
    some_field2_a
)
select
    n,
    'abcd'
from Tally;

create clustered index cix_table_a_key_a on table_a (key_a);


create table table_b
(
    key_b int identity,
    some_field_b int,
    some_field2_b varchar(10)
);

WITH Tally (n) AS
(
    -- 1000 rows
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
    CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
)
insert into table_b
(
    some_field_b,
    some_field2_b
)
select 
    n,
    'abcd'
from Tally;

create clustered index cix_table_b_key_b on table_b (key_b);

Query 1:

select 
    key_a,
    some_field_a,
    some_field2_a
from table_a
WHERE NOT EXISTS
(
  select NULL 
  from table_b
  WHERE
    key_b = key_a AND
    some_field_b = some_field_a AND
    some_field2_b = some_field2_a
)

Results:

Steve Ford
  • 7,433
  • 19
  • 40
  • So, all except columns must be EXPLICITLY sorted before Merge Join? There is no way for optimizer to know something is sorted without the actual sort operation in this case? – dominjas Sep 02 '19 at 10:15
  • @dominjas see updated answer showing how to use NOT EXISTS to achieve the result with no sorting – Steve Ford Sep 02 '19 at 10:23
  • Thank you for your answer and examples, but there is a way to achieve my goal with EXCEPT and only key_a/key_b columns as index key. I only needed to change the index - add UNIQUE. After that there were no sorts in plan. See comments under my question and also: http://sqlfiddle.com/#!18/a3f16/1 – dominjas Sep 02 '19 at 11:50
0

Thank you all guys for your help. Below is full answer for my question gathered from comments and answers:

  1. Why there are sort operators after clustered index scans in this situation?

Sort operators are there because index key columns (key_a, key_b) are not unique for optimizer.

  1. How can I avoid these sorts when I want to use EXCEPT operator?

Be sure that your index key columns are unique - use UNIQUE CLUSTERED INDEX or set constraint on these fields.

  1. What are the better ways (if there is any) of doing table comparison?

Alternate solutions considering adding more columns to index key or using NOT EXISTS instead of EXCEPT were given in answer of Steve Ford (@steve-ford).

dominjas
  • 31
  • 4