Snowflake match_recognize not including all instances of a given pattern using quantifiers

Question

I'm trying to use Snowflake's match_recognize tool to match a particular pattern across rows. The pattern consists of any sequence of a's and b's, provided that long runs of b's are excluded. In my test case, I want to allow runs of up to 4 b's to be included in the match.

Using the handy https://regexr.com/?2tp0k website, I was able to build the desired regexp:

((ab{0,4})+a)|a+

Applying it to this string:

baabbbaaaaaaaababbabbabbabbbabbbab

I get this one match (in bold), which I am happy with:

baabbbaaaaaaaababbabbabbabbbabbbab

As desired, this is absorbing into the match any run of b's that is 4 or shorter. (It doesn't pick up b at the beginning of the string or the b at the end, but that is expected.) Also note that while it doesn't contain any long runs of b's, there are a bunch of b's spread throughout that match.

For some reason, when I use this regular expression with Snowflake's match_recognize pattern, it doesn't take up all of the short runs of b's.

Hence, instead of matching the entire sequence matched above, it matches these:

baabbbaaaaaaaababbabbabbabbbabbbab

Any suggestions?

Here's the query that illustrates the result:

WITH data AS (
    SELECT * FROM VALUES 
     ( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
     (10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
     (20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
     (30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data 
match_recognize(
    order by column1
    measures
        match_number() as "MATCH_NUMBER",
        match_sequence_number() as msq,
        classifier() as cl
    all rows per match with unmatched rows
    PATTERN ( ((a b{0,4})+ a) | a+ )
    DEFINE
        a as column2 = 'a',
        b as column2 = 'b'
)
ORDER BY 1;

Resulting in this result. Rows 25-27 are not included in the match, and a new match is started at row 28.

Image of results

score 1 · Answer 1 · answered Sep 20 '22 at 17:00

Interestingly enough when changing pattern from ((ab{0,4})+a)|a+ to ( ((a | ab | abb | abbb | abbbb)+ a) | a+ ) it produces:

WITH data AS (
    SELECT * FROM VALUES 
     ( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
     (10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
     (20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
     (30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data 
match_recognize(
    order by column1
    measures
        match_number() as "MATCH_NUMBER",
        match_sequence_number() as msq,
        classifier() as cl
    all rows per match with unmatched rows
    PATTERN ( ((a | ab | abb | abbb | abbbb)+ a) | a+ )
    DEFINE
        a as column2 = 'a',
        b as column2 = 'b'
)
ORDER BY 1;

Output:

That is interesting, though in my actual case, I'm wanting to include runs of up to 59 (e.g. ((ab{0,59})+a)|a+ ). I just used up to 4 to make a functional test case. — Eric Cler, Sep 20 '22 at 18:05

score 0 · Accepted Answer · answered Sep 26 '22 at 08:04

0

WITH data AS (
SELECT * FROM VALUES 
 ( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
 (10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
 (20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
 (30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data  
match_recognize(
    order by column1
    measures
        match_number() as "MATCH_NUMBER",
        match_sequence_number() as msq,
        classifier() as cl
 all rows per match  with unmatched rows
AFTER MATCH SKIP  PAST LAST ROW
PATTERN ( ((a+ b{1,4})+ a) | a+ )
DEFINE
    a as column2 = 'a',
    b as column2 = 'b'
) ORDER BY 1;

answered Sep 26 '22 at 08:04

Adrian White

1,720
12
14

Well, this modification does appear to work! Do you have an understanding why this one would work and the former would not? (For what it's worth, I think someone from Snowflake is looking into this.) – Eric Cler Sep 26 '22 at 12:48
Short = Regex typo. Just added '+' to the first a. Long = Your original regex enforced a single 'a' at the start of the match and unlimited 'a's at the end. When there was a single 'a' between the b's it was forced to be part of the previous match (as you skip past last row). a+ b{1,4})+ a vs a b{1,4})+ a. Just needed fresh eyes :-) – Adrian White Sep 26 '22 at 23:54
Please stop the Snowflake person -> we don't really want them tinkering with things - this was simply a typo in the regex. – Adrian White Sep 26 '22 at 23:56

Snowflake match_recognize not including all instances of a given pattern using quantifiers

2 Answers2