3

I'm trying to use Snowflake's match_recognize tool to match a particular pattern across rows. The pattern consists of any sequence of a's and b's, provided that long runs of b's are excluded. In my test case, I want to allow runs of up to 4 b's to be included in the match.

Using the handy https://regexr.com/?2tp0k website, I was able to build the desired regexp:

((ab{0,4})+a)|a+

Applying it to this string:

baabbbaaaaaaaababbabbabbabbbabbbab

I get this one match (in bold), which I am happy with:

baabbbaaaaaaaababbabbabbabbbabbbab

As desired, this is absorbing into the match any run of b's that is 4 or shorter. (It doesn't pick up b at the beginning of the string or the b at the end, but that is expected.) Also note that while it doesn't contain any long runs of b's, there are a bunch of b's spread throughout that match.

For some reason, when I use this regular expression with Snowflake's match_recognize pattern, it doesn't take up all of the short runs of b's.

Hence, instead of matching the entire sequence matched above, it matches these:

baabbbaaaaaaaababbabbabbabbbabbbab

Any suggestions?

Here's the query that illustrates the result:

WITH data AS (
    SELECT * FROM VALUES 
     ( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
     (10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
     (20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
     (30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data 
match_recognize(
    order by column1
    measures
        match_number() as "MATCH_NUMBER",
        match_sequence_number() as msq,
        classifier() as cl
    all rows per match with unmatched rows
    PATTERN ( ((a b{0,4})+ a) | a+ )
    DEFINE
        a as column2 = 'a',
        b as column2 = 'b'
)
ORDER BY 1;

Resulting in this result. Rows 25-27 are not included in the match, and a new match is started at row 28.

Image of results

Eric Cler
  • 33
  • 3

2 Answers2

1

Interestingly enough when changing pattern from ((ab{0,4})+a)|a+ to ( ((a | ab | abb | abbb | abbbb)+ a) | a+ ) it produces:

WITH data AS (
    SELECT * FROM VALUES 
     ( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
     (10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
     (20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
     (30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data 
match_recognize(
    order by column1
    measures
        match_number() as "MATCH_NUMBER",
        match_sequence_number() as msq,
        classifier() as cl
    all rows per match with unmatched rows
    PATTERN ( ((a | ab | abb | abbb | abbbb)+ a) | a+ )
    DEFINE
        a as column2 = 'a',
        b as column2 = 'b'
)
ORDER BY 1;

Output:

enter image description here

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • That is interesting, though in my actual case, I'm wanting to include runs of up to 59 (e.g. ((ab{0,59})+a)|a+ ). I just used up to 4 to make a functional test case. – Eric Cler Sep 20 '22 at 18:05
0

enter image description here

WITH data AS (
SELECT * FROM VALUES 
 ( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
 (10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
 (20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
 (30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data  
match_recognize(
    order by column1
    measures
        match_number() as "MATCH_NUMBER",
        match_sequence_number() as msq,
        classifier() as cl
 all rows per match  with unmatched rows
AFTER MATCH SKIP  PAST LAST ROW
PATTERN ( ((a+ b{1,4})+ a) | a+ )
DEFINE
    a as column2 = 'a',
    b as column2 = 'b'
) ORDER BY 1;
Adrian White
  • 1,720
  • 12
  • 14
  • Well, this modification does appear to work! Do you have an understanding why this one would work and the former would not? (For what it's worth, I think someone from Snowflake is looking into this.) – Eric Cler Sep 26 '22 at 12:48
  • Short = Regex typo. Just added '+' to the first a. Long = Your original regex enforced a single 'a' at the start of the match and unlimited 'a's at the end. When there was a single 'a' between the b's it was forced to be part of the previous match (as you skip past last row). a+ b{1,4})+ a vs a b{1,4})+ a. Just needed fresh eyes :-) – Adrian White Sep 26 '22 at 23:54
  • Please stop the Snowflake person -> we don't really want them tinkering with things - this was simply a typo in the regex. – Adrian White Sep 26 '22 at 23:56