Gaps and islands, whole division edition

Question

so I'm trying to implement gaps and islands problem over a set of numbers, here's an example:

0 0 3 4 5 6 6 7 7 8 11 12 18 22

That's a cumulative sum, so the numbers can not decrease. What I need to do is to separate respective records into groups by these rules:

The first (and the smallest) number in a group is a "leading" number
A number can be in the same group as its' leading number only if it's larger by no more than a certain number (let's say 7 for this example)
The first number to exceed leading number + 7 is a leading number for the next group

So with the example shown above, the groups would be: 0 0 3 4 5 6 6 7 7 8 11 12 18 22

It's almost like dividing a number by the gap, and getting a group number that way, but since a gap between the last number in a group and the leading number in the next group can be any positive number, this would get more and more incorrect as the sequence grows. I tried case when sum(...) over(...) > 7 then sum(...) else 0 end but it affects every single number that's not in the first group, so I'm not sure how to approach this anymore. Any help would be appreciated! In case that's important, the table is in Snowflake

Lukasz Szozda · Answer 1 · 2021-11-24T17:19:40.073

4

Snowflake supports MATCH_RECOGNIZE which is designed to find patterns in series of rows:

SELECT col2,  bin_num
FROM T
MATCH_RECOGNIZE (
  ORDER BY col2
  MEASURES  MATCH_NUMBER() AS bin_num
  ALL ROWS PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN ( A+ )
  DEFINE A AS FIRST(col2) + 7 >=  A.col2
)
ORDER BY COL2;

db<>fiddle demo

Output:

An alternative approach to be used is recursive cte.

WITH RECURSIVE cte AS (
   SELECT *, ROW_NUMBER() OVER(ORDER BY col2) AS rn
   FROM  t
), rec AS (
  SELECT col2, rn, col2 AS first_val, 1 AS grp
  FROM cte
  WHERE rn = 1
  UNION ALL
  SELECT c.col2, c.rn, IFF(r.first_val + 7 >= c.col2, r.first_val, c.col2), 
         grp + IFF(r.first_val + 7 >= c.col2, 0, 1)
  FROM rec r
  JOIN cte c
    ON r.rn = c.rn-1
)
SELECT col2, grp
FROM rec
ORDER BY col2;

db<>fiddle demo

Related: https://stackoverflow.com/a/53994970/5070879 and https://stackoverflow.com/a/52936314/5070879

edited Nov 24 '21 at 17:19

answered Nov 24 '21 at 14:53

Lukasz Szozda

162,964
23
234
275

So I tried the MATCH_RECOGNIZE option, and what I'm seeing right now is everything but the first group is being filtered out, by swapping ">=" in DEFINE A AS FIRST_VALUE(col2) + 7 >= A.col2 to "<=" I'm getting just the first group filtered, and all the values are equal to 1. Also I changed FIRST with FIRST_VALUE, since I get an error when trying to use FIRST, so the results are completely different. What do you think I might be missing? – H8oddo Nov 30 '21 at 11:43
After trying to dig through this myself, it seems that according to MATCH_RECOGNIZE() Snowflake documentation, DEFINE A AS FIRST(col2) + 7 >= A.col2 is invalid since FIRST(...) can't be used in a DEFINE subclause. Is there a way to bypass it? – H8oddo Nov 30 '21 at 12:00
Hi @H8oddo. The FIRST_VALUE cannot be used as it changes the meaning of the pattern(first value per partition and there is single partition in this case). Yes, it seems Snowflake as for now does not support FIRST clause: https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html#limitations-on-window-functions-used-in-define-and-measures. It may be possible to rewrite it with `LAST()` and changing the regular expression. As for now, you could still use recursive version. – Lukasz Szozda Nov 30 '21 at 14:56

Gaps and islands, whole division edition

1 Answers1