Skip null values when counting with RANK() OVER

Question

Given a set of rows, with a field sometimes null and sometimes not:

SELECT 
   Date, TheThing
FROM MyData
ORDER BY Date


Date                     TheThing
-----------------------  --------
2016-03-09 08:17:29.867  a
2016-03-09 08:18:33.327  a
2016-03-09 14:32:01.240  NULL
2016-10-21 19:53:49.983  NULL
2016-11-12 03:25:21.753  b
2016-11-24 07:43:24.483  NULL
2016-11-28 16:06:23.090  b
2016-11-28 16:09:07.200  c
2016-12-10 11:21:55.807  c

I want to have a ranking column that counts the non-null values:

Date                     TheThing  DesiredTotal
-----------------------  --------  ------------
2016-03-09 08:17:29.867  a         1
2016-03-09 08:18:33.327  a         2
2016-03-09 14:32:01.240  NULL      2 <---notice it's still 2 (good)
2016-10-21 19:53:49.983  NULL      2 <---notice it's still 2 (good)
2016-11-12 03:25:21.753  b         3
2016-11-24 07:43:24.483  NULL      3 <---notice it's still 3 (good)
2016-11-28 16:06:23.090  b         4
2016-11-28 16:09:07.200  c         5
2016-12-10 11:21:55.807  c         6

I try the obvious:

SELECT 
   Date, TheThing, 
   RANK() OVER(ORDER BY Date) AS Total
FROM MyData
ORDER BY Date

But RANK() counts nulls:

Date                     TheThing  Total
-----------------------  --------  -----
2016-03-09 08:17:29.867  a         1
2016-03-09 08:18:33.327  a         2
2016-03-09 14:32:01.240  NULL      3 <--- notice it is 3 (bad)
2016-10-21 19:53:49.983  NULL      4 <--- notice it is 4 (bad)
2016-11-12 03:25:21.753  b         5 <--- and all the rest are wrong (bad)
2016-11-24 07:43:24.483  NULL      7
2016-11-28 16:06:23.090  b         8
2016-11-28 16:09:07.200  c         9
2016-12-10 11:21:55.807  c         10

How can i instruct RANK() (or DENSE_RANK()) to not count nulls?

Have you tried using a PARTITION?

Why yes! Much worse:

SELECT 
   Date, TheThing, 
   RANK() OVER(PARTITION BY(CASE WHEN TheThing IS NOT NULL THEN 1 ELSE 0 END) ORDER BY Date) AS Total
FROM MyData
ORDER BY Date

But RANK() counts nulls:

Date                     TheThing  Total
-----------------------  --------  -----
2016-03-09 08:17:29.867  a         1
2016-03-09 08:18:33.327  a         2
2016-03-09 14:32:01.240  NULL      1 <--- reset to 1?
2016-10-21 19:53:49.983  NULL      2 <--- why go up?
2016-11-12 03:25:21.753  b         3 
2016-11-24 07:43:24.483  NULL      3 <--- didn't reset?
2016-11-28 16:06:23.090  b         4 
2016-11-28 16:09:07.200  c         5
2016-12-10 11:21:55.807  c         6

And now i randomly type things - frantic flailing.

SELECT 
   Date, TheThing, 
   RANK() OVER(PARTITION BY(CASE WHEN TheThing IS NOT NULL THEN 1 ELSE NULL END) ORDER BY Date) AS Total
FROM MyData
ORDER BY Date

SELECT 
   Date, TheThing, 
   DENSE_RANK() OVER(PARTITION BY(CASE WHEN TheThing IS NOT NULL THEN 1 ELSE NULL END) ORDER BY Date) AS Total
FROM MyData
ORDER BY Date

Edit: With all the answers, it took many iterations to find all the edge cases that i don't want. In the end what i conceptually wanted was OVER() in order to count. I had no idea OVER applied to anything other than RANK (and DENSE_RANK).

http://sqlfiddle.com/#!18/c6d87/1

Bonus Reading

I don't think if you can do this in a single query. First you need to filter the records for not Nulls and then select all with the ranks that has been created previously. — Majid Akbari, Aug 13 '18 at 17:37
Is `TheThing` ever anything other than `Frob` or `null`? Does that affect the running total? Please show all edge cases. — Aaron Bertrand, Aug 13 '18 at 17:49
Does `Rank()` with `case when TheThing is NULL then Lag ... else TheThing end` to use the prior row's value get you anywhere? It may need a fudge factor to handle an initial `null` value should that occur. — HABO, Aug 14 '18 at 01:57
@HABO or two NULLs in a row, since LAG() would have to be aware of not a constant but rather how many rows back is the last non-NULL. — Aaron Bertrand, Aug 14 '18 at 02:54

score 5 · Accepted Answer · answered Aug 13 '18 at 18:36

5

I think you are looking for a cumulative count:

SELECT Date, TheThing, 
       COUNT(theThing) OVER (ORDER BY Date) AS Total
FROM MyData
ORDER BY Date;

answered Aug 13 '18 at 18:36

Gordon Linoff

1,242,037
58
646
786

1

It won't work the same as `RANK` nor as `DENSE_RANK` when you introduce ties. And is almost the same as first query in my answer :) and that was pointed to me as well :) – Michał Turczyn Aug 13 '18 at 18:53
1

While all the other answer are good, and will work for other people trying to solve another problem, the actual problem i wanted solving indeed was *"counting"*. I thought `RANK OVER` ***was*** the way to count. Turns out `COUNT OVER` is something new. – Ian Boyd Aug 14 '18 at 13:56

Michał Turczyn · Answer 2 · 2018-08-13T18:48:03.200

4

Try this:

declare @tbl table (dt datetime, col int);
insert into @tbl values
('2016-03-09 08:17:29.867', 1),
('2016-03-09 08:18:33.327', 1),
('2016-03-09 14:32:01.240', NULL),
('2016-10-21 19:53:49.983', NULL),
('2016-11-12 03:25:21.753', 1),
('2016-11-24 07:43:24.483', NULL),
('2016-11-28 16:06:23.090', 1),
('2016-11-28 16:09:07.200', 1),
('2016-12-10 11:21:55.807', 1);

select dt,
       col,
       sum(case when col is null then 0 else 1 end) over (order by dt) rnk
from @tbl

The idea is really simple: if you assign 1 to non-null values and zero where the column is null, cumulative sum ordered by date is works exactly as rank excluding nulls.

Other way would be to use RANK combined with ROW_NUMBER, which will respect ties in Date column and works exactly as RANK respecting NULLs:

select dt,
       col,
       case when col is not null then 
           rank() over (order by dt)
       else 
           rank() over (order by dt) - row_number() over (partition by rnDiff order by dt)
       end rnk
from (
    select dt,
           col,
           row_number() over (order by dt) -
               row_number() over (partition by coalesce(col, 0) order by dt) rnDiff
    from @tbl
) a
order by dt

edited Aug 13 '18 at 18:48

answered Aug 13 '18 at 17:56

Michał Turczyn

32,028
14
47
69

2

this doesnt handle duplicates as RANK would – Martin Smith Aug 13 '18 at 17:58
though possibly they were on the wrong track with RANK anyway. But the difference will be seen for example if multiple values exist tied with the lowest dt. RANK would assign all 1. This wont. – Martin Smith Aug 13 '18 at 18:07
Actually, it will assign rank correctly, but instead of 1, there would be 2 (in casse of two tied rows), but it would be the same for tied rows. – Michał Turczyn Aug 13 '18 at 18:11
@MartinSmith Second query handles ties correctly, but I think (from presented data) that first query would be safe to use. – Michał Turczyn Aug 13 '18 at 18:49

score 1 · Answer 3 · answered Aug 13 '18 at 17:47

1

My lizard brain brings me here... sum() over vs rank()

Select *
       ,NewCol = sum(sign(TheThing)) over (Order by Date)
       ,OrEven = sum(TheThing/TheThing) over (Order by Date)  
 From  MyData

Returns

answered Aug 13 '18 at 17:47

John Cappelletti

79,615
7
44
66

score 1 · Answer 4 · answered Aug 13 '18 at 19:59

What about subtracting the current count of NULLs from the rank()?

SELECT date,
       thething,
       rank() OVER (ORDER BY date)
       -
       sum(CASE
             WHEN thething IS NULL THEN
               1
             ELSE
               0
           END) OVER (ORDER BY date) desiredtotal
       FROM mydata;

db<>fiddle

That should also keep the duplicates and gaps rank() produces and doesn't need a subquery.

Yogesh Sharma · Answer 5 · 2018-08-13T18:02:16.157

0

I would use subquery :

SELECT [Date], TheThing,
       (SELECT COUNT(*)
        FROM MyData m
        WHERE m.[Date] <= m1.[Date] AND m.TheThing IS NOT NULL
       ) AS DesiredTotal
FROM MyData m1;

In, similar way you can also try with apply :

SELECT *
FROM MyData m1 CROSS APPLY
    (SELECT COUNT(*) AS DesiredTotal
     FROM MyData m
     WHERE m.[Date] <= m1.[Date] AND m.TheThing IS NOT NULL
    ) m2;

edited Aug 13 '18 at 18:02

answered Aug 13 '18 at 17:54

Yogesh Sharma

49,870
5
26
52

This will work but it's very sensitive to size of data... be wary of exponential runtimes on larger data sets. – Aaron Bertrand Aug 13 '18 at 17:55
@AaronBertrand. . . I know this is not an efficient way to generate the series. – Yogesh Sharma Aug 13 '18 at 18:03

score 0 · Answer 6 · answered Aug 13 '18 at 18:01

I used a CTE to get the right date first and then applied the rank to the modified date:

CREATE TABLE #tmp(dt datetime, TheThing int)

INSERT INTO #tmp VALUES('2016-03-09 08:17:29.867',  1)
INSERT INTO #tmp VALUES('2016-03-09 08:18:33.327',  1)
INSERT INTO #tmp VALUES('2016-03-09 14:32:01.240',  NULL)
INSERT INTO #tmp VALUES('2016-10-21 19:53:49.983',  NULL)
INSERT INTO #tmp VALUES('2016-11-12 03:25:21.753',  1)
INSERT INTO #tmp VALUES('2016-11-24 07:43:24.483',  NULL)
INSERT INTO #tmp VALUES('2016-11-28 16:06:23.090',  1)
INSERT INTO #tmp VALUES('2016-11-28 16:09:07.200',  1)
INSERT INTO #tmp VALUES('2016-12-10 11:21:55.807',  1)


;WITH CTE as (
SELECT 
CASE WHEN TheThing IS NULL THEN (SELECT MAX(dt) from #tmp OrigTbl where OrigTbl.dt <     SubTbl.dt and OrigTbl.TheThing IS NOT NULL) ELSE dt end dtMod,
SubTbl.dt,SubTbl.TheThing
   from #tmp SubTbl)
SELECT dt, TheThing, DENSE_RANK() over(ORDER BY dtMod) from CTE

Skip null values when counting with RANK() OVER

Have you tried using a PARTITION?

Bonus Reading

6 Answers6