18

SQL Fiddle

Background

I have a table of values that some need attention:

| ID      | AddedDate   |
|---------|-------------|
|       1 | 2010-04-01  |
|       2 | 2010-04-01  |
|       3 | 2010-04-02  |
|       4 | 2010-04-02  |
|       5 | NULL        | <----------- needs attention
|       6 | 2010-04-02  |
|       7 | 2010-04-03  |
|       8 | 2010-04-04  |
|       9 | 2010-04-04  |
| 2432659 | 2016-06-15  |
| 2432650 | 2016-06-16  |
| 2432651 | 2016-06-17  |
| 2432672 | 2016-06-18  |
| 2432673 | NULL        | <----------- needs attention
| 2432674 | 2016-06-20  |
| 2432685 | 2016-06-21  |

I want to select the rows where AddedDate is null, and i want to select rows around it. In this example question it would be sufficient to say rows where the ID is ±3. This means i want:

| ID      | AddedDate   |
|---------|-------------|
|       2 | 2010-04-01  | ─╮
|       3 | 2010-04-02  |  │
|       4 | 2010-04-02  |  │
|       5 | NULL        |  ├──ID values ±3
|       6 | 2010-04-02  |  │
|       7 | 2010-04-03  |  │
|       8 | 2010-04-04  | ─╯

| 2432672 | 2016-06-18  | ─╮
| 2432673 | NULL        |  ├──ID values ±3
| 2432674 | 2016-06-20  | ─╯

Note: In reality it's a table of 9M rows, and 15k need attention.

Attempts

First i create a query that builds the ranges i'm interested in returning:

SELECT
  ID-3 AS [Low ID],
  ID+3 AS [High ID]
FROM Items
WHERE AddedDate IS NULL

Low ID   High ID
-------  -------
2        8 
2432670  2432676

So my initial attempt to use this does work:

WITH dt AS (
   SELECT ID-3 AS Low, ID+3 AS High
   FROM Items
   WHERE AddedDate IS NULL
)
SELECT * FROM Items
WHERE EXISTS(
    SELECT 1 FROM dt
    WHERE Items.ID BETWEEN dt.Low AND dt.High)

But when i try it on real data:

  • 9 million total rows
  • 15,000 interesting rows
  • subtree cost of 63,318,400
  • it takes hours (before i give up and cancel it)

enter image description here

There's probably a more efficient way.

Bonus Reading

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219

4 Answers4

4

Here is one method that uses the windowing clause:

select i.*
from (select i.*,
             count(*) over (order by id rows between 3 preceding and 1 preceding) as cnt_prec,
             count(*) over (order by id rows between 1 following and 3 following) as cnt_foll,
             count(addeddate) over (order by id rows between 3 preceding and 1 preceding) as cnt_ad_prec,
             count(addeddate) over (order by id rows between 1 following and 3 following) as cnt_ad_foll
      from items
     ) i
where cnt_ad_prec <> cnt_prec or
      cnt_ad_foll <> cnt_foll or
      addeddate is null;
order by id;

This returns all rows that have NULL in the column or are within three rows of a NULL.

The need for the comparison to the count is to avoid the edge issues on the smallest and largest ids.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
4

This is your existing logic rewritten using an moving max:

WITH dt AS (
   SELECT
      ID, AddedDate,
      -- check if there's a NULL within a range of +/- 3 rows
      -- and remember it's ID 
      max(case when AddedDate is null then id end)
      over (order by id 
            rows between 3 preceding and 3 following) as NullID
   FROM Items 
)
SELECT *
FROM dt
where id between NullID-3 and NullID+3
dnoeth
  • 59,503
  • 4
  • 39
  • 56
3

Another way:

SELECT i1.*
    FROM Items i1, Items i2
        WHERE i2.AddedDate IS NULL AND ABS(i1.ID - i2.ID) <= 3

I hope there is index on AddedDate column.

i486
  • 6,491
  • 4
  • 24
  • 41
  • With an estimated subtree cost of 15M, it let it ran for 45 minutes before cancelling. But it's a much more understandable method than the others. – Ian Boyd Feb 24 '18 at 18:08
  • 1
    This would be a good use of Filtered Indexes in SQL server 2008 or newer. – Sql Surfer Feb 24 '18 at 18:25
  • 1
    @IanBoyd The SQL Fiddle shows different plans for this form and `WHERE i2.AddedDate IS NULL AND i1.ID BETWEEN i2.ID - 3 AND i2.ID + 3`. Is it also different for you? – GSerg Feb 24 '18 at 19:51
  • @SqlSurfer I don't think it would be. If you mean a filtered index on `AddedDate`, then a regular index will give you `null` values just as fast, and you cannot have a filtered index for the `ID+-3` situation. – GSerg Feb 24 '18 at 19:53
1

Just to try a different approach than other answers... How about using a table variable to store the ids you want. Then you join. My hope is that the insert executes fast enough, and then the SELECT can take advantage of the clustered index in Items. Unfortunately I don't have here your amount of data to test its efficiency:

DECLARE @userData TABLE(
    idInRange int NOT NULL
)

INSERT INTO @userData (idInRange)
SELECT DISTINCT i.Id + r 
FROM Items i 
CROSS JOIN (
  SELECT -3 as r UNION ALL SELECT -2 as r UNION ALL SELECT -1 as r UNION ALL SELECT 0 as r UNION ALL
  SELECT 1 as r UNION ALL SELECT 2 as r UNION ALL SELECT 3 as r 
) yourRange
WHERE AddedDate IS NULL;

SELECT i.*
FROM @userData u
INNER JOIN Items i ON i.ID = u.idInRange

Edited to add a DISTINCT when filling the table variable, to avoid duplicated rows just in case there are two contiguous NULL dates and their id ranges overlap

Jorge Y.
  • 1,123
  • 1
  • 9
  • 16