How can I prevent SQL Server from squaring the number of rows scanned?

Question

I'm running a query over a table variable that holds 22 227 rows. The query used to take 2-3 seconds to complete (which I still think is too slow) but since I added another field to the ORDER BY clause in DENSE_RANK() it now completes in 4.5 minutes!

If I include [t2].[aisdt] with or without [t2].[aiID], the execution plan shows that it's scanning 494 039 529 rows, which is 22 227 squared. The following query generates the correct results, just much too slowly to be useful.

SELECT MAX([t].[SetNum]) OVER (PARTITION BY NULL) AS [MaxSet]
      ,*
FROM (
    SELECT DENSE_RANK() OVER (ORDER BY [t2].[aisdt], [t2].[aiID]) AS [SetNum]
          ,[t2].*
    FROM (
        SELECT [aiID]
              ,COUNT(DISTINCT [acID]) AS [noac]
        FROM @Temp
        GROUP BY [aiID]
    ) [t1]
    JOIN @Temp [t2]
      ON [t2].[aiID] = [t1].[aiID]
    WHERE [t1].[noac] < [t2].[asm]
) [t]

Just to be clear, the culprit is the bold section in "DENSE_RANK() OVER (ORDER BY [t2].[aisdt], [t2].[aiID])". Removing this field (which needs to remain) drops the execution time back down to 2-3 seconds. I think it might have something to do with JOINing the table to itself on [aiID] but not [aisdt].

How can I speed this query up to complete in the same time as before, or less?

EDIT

Table definition:

DECLARE @Temp TABLE (
    [aiID] INT NOT NULL INDEX [IX_Temp_aiID] -- not unique
    ,[aisdt] DATETIME NOT NULL INDEX [IX_Temp_aisdt] -- not unique
    ,[asm] INT NOT NULL
    ,[cpcID] INT NULL
    ,[cpce] VARCHAR(10) NULL
    ,[acID] INT NULL
    ,[ctvID] INT NULL
    ,[ct] VARCHAR(100) NULL
    ,[_36_other_non_matched_fields_] VARCHAR(MAX)

    ,UNIQUE ([aiID], [cpcID], [cpce], [acID], [ctvID], [ct])
)

[aisdt] is unique per [aiID], but there can be multiple [aiID]s with the same [aisdt].

INSERT INTO @TEMP
VALUES (64, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah')
      ,(64, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 32, '', 32, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 32, '', 47, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 47, '', 32, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 47, '', 47, NULL, NULL, 'blah')
      ,(54, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah')
      ,(54, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 32, '', 32, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 32, '', 47, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 47, '', 32, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 47, '', 47, NULL, NULL, 'blah')

It must be sorted by [aisdt] (datetime) first, then [aiID], then numbered into sets based on [aiID].

I want to see:

5, 1, 54, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah'
5, 1, 54, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah'
5, 2, 64, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah'
5, 2, 64, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 32, '', 32, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 32, '', 47, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 47, '', 32, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 47, '', 47, NULL, NULL, 'blah'

Have you added any indexes to the table variable? e.g `DECLARE @Temp TABLE([aisdt] int, [aiID] int, [acID] int, [asm] int, PRIMARY KEY([aiID], [asm]))`? (you may need to add other key columns to the end until you get to a combination that is expected to be unique) — Martin Smith, Jun 22 '17 at 07:56
@MartinSmith Yes, both fields in the `DENSE_RANK()` have indices, and I tried putting an index on both of them together, but that didn't help by much. — CJ Dennis, Jun 22 '17 at 08:49

score 2 · Accepted Answer · answered Jun 23 '17 at 05:53

The main idea is taken from Partition Function COUNT() OVER possible using DISTINCT that @Jayvee pointed out with a small addition that would make it work when acID has NULL values.

Most likely you can remove all indexes from your @Temp table, the server will have to sort it in several different ways for different window functions anyway, but there is no self-join, so it should be faster.

The plan will have many sorts and they also can be slow, especially when engine underestimates the number of rows in a table. And table variable is exactly this case. Optimiser thinks that table variable has only 1 row. So, I'd recommend to use a classic #Temp table here, even without indexes.

An index on (aiID, acID) should help, but there will be other sorts any way.

WITH
CTE_Counts
AS
(
    SELECT
        *
        -- use DENSE_RANK() to calculate COUNT(DISTINCT)
        , DENSE_RANK() OVER (PARTITION BY [aiID] ORDER BY [acID])
        + DENSE_RANK() OVER (PARTITION BY [aiID] ORDER BY [acID] DESC)
        -- subtract extra 1 if acID has NULL values within the partition
        - MAX(CASE WHEN [acID] IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY [aiID])
        - 1 AS [noac]
    FROM @Temp
)
,CTE_SetNum
AS
(
    SELECT
        *
        , DENSE_RANK() OVER (ORDER BY [aisdt], [aiID]) AS [SetNum]
    FROM CTE_Counts
    WHERE [noac] < [asm]
)
SELECT
    *
    , MAX([SetNum]) OVER () AS [MaxSet]
FROM CTE_SetNum
ORDER BY
    [aisdt]
    ,[aiID]
    ,[SetNum]
;

This should work too, also nice enhancement to the dense_rank trick, +1 — Jayvee, Jun 23 '17 at 08:48
Thank you! That's brilliant! Yes, the indices are not necessary (they only saved half a second before anyway). — CJ Dennis, Jun 27 '17 at 06:23

Jayvee · Answer 2 · 2017-06-22T09:10:14.190

1

Index as suggested in the comments would definitely play a major part but also I think you can re-write the query without self join in this way:

   SELECT MAX([t].[SetNum]) OVER (PARTITION BY NULL) AS [MaxSet]
      ,*
FROM (
    SELECT DENSE_RANK() OVER (ORDER BY [t1].[aisdt], [t1].[aiID]) AS [SetNum]
          ,[t1].*
    FROM (
        SELECT * ,dense_rank() over(partition by aiID order by [acID]) - 
        dense_rank() over(partition by aiID order by [acID]) - 1 AS [noac]
        FROM @Temp
    ) [t1]
    WHERE [t1].[noac] < [t1].[asm]
) [t]

edited Jun 22 '17 at 09:10

answered Jun 22 '17 at 08:45

Jayvee

10,670
3
29
40

You can't use count distinct over - not valid – t-clausen.dk Jun 22 '17 at 08:52
1

Thanks @t-clausen.dk , I have updated to use dense_rank as count distinct based on this: https://stackoverflow.com/questions/11202878/partition-function-count-over-possible-using-distinct – Jayvee Jun 22 '17 at 09:10
Your query is nice and fast (a fraction of a second) but it produces far too many results (yours: 16 427 sets compared to mine: 12 338 sets). You must have removed some of my filtering, but I haven't analysed your query yet to see what's logically different. – CJ Dennis Jun 23 '17 at 01:34
Thanks for a link to the question that discusses calculating `COUNT(DISTINCT)` using `DENSE_RANK`. Very clever. – Vladimir Baranov Jun 23 '17 at 05:57
I just noticed that all 22 227 rows are returned. Some of them should get filtered out. – CJ Dennis Jun 23 '17 at 07:28
in theory. if you run the inner query returning table [t1] should bring the 22227 rows with an added column (noac) , then the wrapper that returns [t] should filter the rows with the where clause [noac] < [asm] so as pointed out by @VladimirBaranov the problem could be the Null values in [acid]. He's made an enhancement to the dense rank trick to work with nulls too. – Jayvee Jun 23 '17 at 08:39

How can I prevent SQL Server from squaring the number of rows scanned?

EDIT

2 Answers2