So below it's not obivous how you want group_id
to relate to pol_id
so I have ignored it.
so using a CTE just for the fake data
.
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6)
v(sess_id, pol_id, trans_dt, version_id)
)
I would then want to write these operations:
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM data
so r1
and r2
are finding when there is gap in sess_id
with respect to trans_dt
, then you are want the those changes of r3
and lag_r3
with respect to trans_dt
, and those are the edges you are wanting to count, thus the SUM
, which is zero based, so a +1
to get the value you want.
Now the above is not valid in Snowflake so needs to be layered to work:
SELECT
*
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
FROM data
)
)
ORDER BY trans_dt;
which gives:
SESS_ID POL_ID TRANS_DT VERSION_ID R1 R2 R3 LAG_R3 SESS_EDGE GROUP_ID
101 1 2021-04-17 09:30:00 1 1 1 0 null 0 1
101 1 2021-04-17 09:35:00 2 2 2 0 0 0 1
102 1 2021-04-17 09:37:00 3 3 1 2 null 0 1
102 1 2021-04-17 09:38:00 4 4 2 2 2 0 1
101 1 2021-04-17 09:39:00 5 5 3 2 0 1 2
101 1 2021-04-17 09:40:00 6 6 4 2 2 0 2
So it can be seen how it is working. This can then be compressed down to:
SELECT
sess_id
,pol_id
,trans_dt
,version_id
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,IFF(LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt)- ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) as r3
FROM data
)
)
ORDER BY trans_dt;
Which is way more complex than Gordon's answer, which rewritten into the same form as my is:
select *
,sum(edge) over ( partition by pol_id, sess_id order by trans_dt ) as grouping
from (
select *
,lag(sess_id) over (partition by pol_id order by trans_dt) as prev_session_id
,iff(prev_session_id = sess_id, 0, 1) AS edge
from data
)
ORDER BY 2,3;
which is rather clever as the SUMing the edges per sess_id
But if you add extra data:
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6),
(102, 1, '2021-04-17 09:41:00', 7),
(102, 1, '2021-04-17 09:42:00', 8),
(103, 1, '2021-04-17 09:43:00', 9),
(103, 1, '2021-04-17 09:44:00', 10)
v(sess_id, pol_id, trans_dt, VERSION_ID)
)
Gordon's answer will assign the last two session the group 1, while mine will assign 2, as will Lukasz, it depends what your intended outcome is.
Also also what do you want to happen when pol_id
changes do you want group to be a global count, or would the second pol have the value 1 again?