Grouping and counting rows by value until it changes

Question

I have a table where messages are stored as they happen. Usually there is a message 'A' and sometimes the A's are separated by a single message 'B'. Now I want to group the values so I'm able to analyze them, for example finding longest 'A'-streak or distribution of 'A'-streaks.

I already tried a COUNT-OVER query but that keeps on counting for each message.

SELECT message, COUNT(*) OVER (ORDER BY Timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

This is my example data:

Timestamp        Message
20150329 00:00   A
20150329 00:01   A
20150329 00:02   B
20150329 00:03   A
20150329 00:04   A
20150329 00:05   A
20150329 00:06   B

I want following output

Message    COUNT
A          2
B          1
A          3
B          1

So there are two columns involved here, message and timestamp? — jarlh, Mar 29 '15 at 09:32
There is a timestamp column but the data is stored in order anyway. — dwonisch, Mar 29 '15 at 09:34
Always consider data as unordered! (Even if it seems to be ordered right now, it may change in the future.) Never ever write queries depending on a an implicit order!!! — jarlh, Mar 29 '15 at 09:40
@jarlh he is ordering by Timestamp, so what's wrong with that? — Mihail Shishkov, Mar 29 '15 at 09:46
@mcl, just because "the data is stored in order anyway" answer. — jarlh, Mar 29 '15 at 09:48

Mihail Shishkov · Accepted Answer · 2015-03-29T11:01:54.320

11

That was interesting :)

;WITH cte as (
SELECT Messages.Message, Timestamp, 
ROW_NUMBER() OVER(PARTITION BY Message ORDER BY Timestamp) AS gn,
ROW_NUMBER() OVER (ORDER BY Timestamp) AS rn
FROM Messages
), cte2 AS (
SELECT Message, Timestamp, gn, rn, gn - rn  as gb
FROM cte 
), cte3 AS (
SELECT Message, MIN(Timestamp) As Ts, COUNT(1) as Cnt
FROM cte2
GROUP BY Message, gb)
SELECT Message, Cnt FROM cte3
ORDER BY Ts

Here is the result set:

  Message   Cnt
    A   2
    B   1
    A   3
    B   1

The query may be shorter but I post it that way so you can see what's happening. The result is exactly as requested. This is the most important part gn - rn the idea is to number the rows in each partition and at the same time number the rows in the whole set then if you subtract the one from the other you'll get the 'rank' of each group.

;WITH cte as (
SELECT Messages.Message, Timestamp, 
ROW_NUMBER() OVER(PARTITION BY Message ORDER BY Timestamp) AS gn,
ROW_NUMBER() OVER (ORDER BY Timestamp) AS rn
FROM Messages
), cte2 AS (
SELECT Message, Timestamp, gn, rn, gn - rn  as gb
FROM cte 
)
SELECT * FROM cte2

Message Timestamp           gn  rn  gb
A   2015-03-29 00:00:00.000 1   1   0
A   2015-03-29 00:01:00.000 2   2   0
B   2015-03-29 00:02:00.000 1   3   -2
A   2015-03-29 00:03:00.000 3   4   -1
A   2015-03-29 00:04:00.000 4   5   -1
A   2015-03-29 00:05:00.000 5   6   -1
B   2015-03-29 00:06:00.000 2   7   -5

edited Mar 29 '15 at 11:01

answered Mar 29 '15 at 10:52

Mihail Shishkov

14,129
7
48
59

Is rn from the first CTE really available in the second CTE? – Mihai Mar 29 '15 at 11:05
It works slow but perfectly (but I have plenty of time for that query). So yes it is available. – dwonisch Mar 29 '15 at 11:06
@Mihai sorry I do not understand your question. – Mihail Shishkov Mar 29 '15 at 11:06
I thought CTE were evaluated as a whole so the value from the first wasnt available for processing in the second or subsequent.Upovted. – Mihai Mar 29 '15 at 11:07
@woni Try to add two indexes. One that is Message ASC, Timestamp ASC and the other just Timestamp ASC – Mihail Shishkov Mar 29 '15 at 11:09
@Mihail you can use cte in cte2 as well as cte2 in cte3 but you can't use cte in cte3 for example – Mihail Shishkov Mar 29 '15 at 11:11

Giorgi Nakeuri · Answer 2 · 2015-03-29T17:02:10.950

Here is a little bit smaller solution:

DECLARE @t TABLE ( d DATE, m CHAR(1) )

INSERT  INTO @t
VALUES  ( '20150301', 'A' ),
        ( '20150302', 'A' ),
        ( '20150303', 'B' ),
        ( '20150304', 'A' ),
        ( '20150305', 'A' ),
        ( '20150306', 'A' ),
        ( '20150307', 'B' );

WITH 
c1 AS(SELECT d, m, IIF(LAG(m, 1, m) OVER(ORDER BY d) = m, 0, 1) AS n FROM @t),
c2 AS(SELECT m, SUM(n) OVER(ORDER BY d) AS n FROM c1) 
    SELECT m, COUNT(*) AS c
    FROM c2
    GROUP BY m, n

Output:

m   c
A   2
B   1
A   3
B   1

The idea is to get value 1 at rows where message is changed:

2015-03-01  A   0
2015-03-02  A   0
2015-03-03  B   1
2015-03-04  A   1
2015-03-05  A   0
2015-03-06  A   0
2015-03-07  B   1

The second step is just sum of current row value + all preceding values:

2015-03-01  A   0
2015-03-02  A   0
2015-03-03  B   1
2015-03-04  A   2
2015-03-05  A   2
2015-03-06  A   2
2015-03-07  B   3

This way you get grouping sets by message column and calculated column.

Grouping and counting rows by value until it changes

2 Answers2

Linked