3

Update: This was answered here.

I am putting together a somewhat complex query to do event detection, join(s), and time-based binning with a large time-series dataset in Snowflake. I recently noticed that match_recognize lets me eloquently detect time-series events, but whenever I try to use a match_recognize expression within a Common Table Expression (with .. as ..), I receive the following error:

SQL compilation error: MATCH_RECOGNIZE not supported in this context.

I've done a lot of searching/reading, but haven't found any documented limitations on match_recognize in CTEs. Here's my query:

with clean_data as (
    -- Remove duplicate entries
    select distinct id, timestamp, measurement
    from dataset
),

label_events as (
    select *
    from clean_data
        match_recognize (
            partition by id
            order by timestamp
            measures
                match_number() as event_number
            all rows per match
            after match skip past last row
            pattern(any_row row_between_gaps+)
            define
                -- Classify contiguous sections of datapoints with < 20min between adjacent points.
                row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
        )
)

-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;

And I get the same error as above with this.

Is this a limitation that I'm not seeing, or am I doing something wrong?

natador
  • 51
  • 5

2 Answers2

2

Non-recursive cte could be always rewritten as inline view:

--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
     from dataset) clean_data
match_recognize (
        partition by id
        order by timestamp
        measures
            match_number() as event_number
        all rows per match
        after match skip past last row
        pattern(any_row row_between_gaps+)
        define
            -- Classify contiguous sections of datapoints with < 20min between adjacent points.
            row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
    )mr
-- ) -- if other transformations are required

It is not ideal, but at least it will allow query to run.

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • True; thanks for the answer. This is the solution I'm using currently, but I'm hoping to avoid it for the sake of code readability. What I want to do after the `match_recognize` step involves some branching computations with a join at the end. Any ideas for that? – natador Jun 29 '21 at 18:29
  • 1
    @natador Depends on the complexity of cte. But you could always try to materialize part of cte as temporary table and based on that build the rest of the logic. – Lukasz Szozda Jun 29 '21 at 18:40
  • 2
    See also https://stackoverflow.com/a/68550312/132438 – Felipe Hoffa Jul 27 '21 at 18:49
2

Per this thread from a comment by Filipe Hoffa: MATCH_RECOGNIZE with CTE in Snowflake

This seemed to be a non-documented limitation of Snowflake at the time. A two or three step solution has worked well for me:

with clean_data as (
    -- Remove duplicate entries
    select distinct id, timestamp, measurement
    from dataset
)

select *
from clean_data
    match_recognize (
        partition by id
        order by timestamp
        measures
            match_number() as event_number
        all rows per match
        after match skip past last row
        pattern(any_row row_between_gaps+)
        define
            -- Classify contiguous sections of datapoints with < 20min between adjacent points.
            row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
    );

set quid=last_query_id();

with label_events as (
    select *
    from table(result_scan($quid))
)

-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;

I prefer to use a variable here, because I can re-run the second query multiple times during development/debugging without having to re-run the first query.

It is also important to note that cached GEOGRAPHY objects in Snowflake are converted to GEOJSON, so when retrieving these with result_scan, you must typecast them back to the GEOGRAPHY type.

natador
  • 51
  • 5