How to get specific rows for each id?

Question

To understand a business process with several statuses,

I want to get the rows with the following rules based on created_at column: 

First row of status ‘created’
Last row of ‘missing_info’ after ‘created’ (row_no 4)
First row of ‘pending’ (row_no 5)
Last row of ‘missing_info’ after ‘pending’ (row_no 7)
First row of ‘pending’ after 'missing_info' (row_no 8)
Last row of ‘successful’ (row_no 10)

Below I highlighted the rows I want to retrieve. 

Here is the example data on DB-FIDDLE

Here is the general flow: created > missing_info > pending > successful. But can be only like this too: created > successful.

I know that I can use QUALIFY with window functions and can get 'created' and 'successful' as below. But I don’t know how to get interim statuses. How can I achieve the desired output?

created AS(

SELECT *

FROM t1

WHERE status = 'created'

QUALIFY ROW_NUMBER() OVER (PARTITION BY STATUS, id ORDER BY created_at) = 1 )

Please note that created and successful are start and end statuses, so there will be only one row in the output. Others like missing_info or pending are interim statuses, so can be multiple of them in the desired output.

EDIT:

To understand a business process with several statuses,

I want to get some of the rows with the following two rules based on created_at column: 

last row of status ‘missing_info’ before ‘pending’ (row 2)
pending (row 3)
last row of status ‘missing_info’ before ‘pending’ (row 5)
pending (row 6)

Example data:

WITH t1 AS (

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:10:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:20:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'pending' AS status, '2021-07-15 11:30:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 12:10:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 12:20:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:30:00'::timestamp AS created_at

    )

SELECT *

FROM t1

Desired output:

I don't understand your rules and results. The 2nd and 4th should return the same rows. — Gordon Linoff, Jul 21 '21 at 13:30
What do you mean by the 2nd and 4th? Here is the general flow: created > missing_info > pending > successful. But can be only like this too: created > successful. As mentioned created and successful are start and end statuses. The other two are interim status. — kimi, Jul 21 '21 at 13:36
@kimi Regardless of the rules, it seems you could use [MATCH_RECOGNIZE](https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html) which is specifically designed to find patterns is set of rows. — Lukasz Szozda, Jul 21 '21 at 14:10
@LukaszSzozda Thanks for the comment! Do you have any idea how to define the pattern by using MATCH_RECOGNIZE to find the interim rows? If my question is not clear, let me know. I try to make it simpler and clearer. — kimi, Jul 21 '21 at 14:39
@kimi Sure, I have provided a simple demo, feel free to modify pattern to match youtr exact needs and extend measures part :) — Lukasz Szozda, Jul 21 '21 at 19:05

Lukasz Szozda · Accepted Answer · 2021-07-22T13:37:20.740

Snowflake implements MATCH_RECOGNIZE, which is the simplest tool for finding complex patterns in pure SQL:

Recognizes matches of a pattern in a set of rows. MATCH_RECOGNIZE accepts a set of rows (from a table, view, subquery, or other source) as input, and returns all matches for a given row pattern within this set. The pattern is defined similarly to a regular expression.

Data preparation:

CREATE OR REPLACE TABLE t
AS
WITH t1 AS (
SELECT 'A' AS id, 'created' AS status, '2021-07-15 10:30:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'created' AS status, '2021-07-15 10:38:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:10:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:12:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:05:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 13:36:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 14:36:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:05:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'successful' AS status, '2021-07-15 16:05:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'successful' AS status, '2021-07-15 17:00:00'::timestamp AS created_at UNION ALL
SELECT 'B' AS id, 'created' AS status, '2021-07-16 10:30:00'::timestamp AS created_at UNION ALL
SELECT 'B' AS id, 'created' AS status, '2021-07-16 11:30:00'::timestamp AS created_at UNION ALL
SELECT 'B' AS id, 'successful' AS status, '2021-07-16 12:30:00'::timestamp AS created_at
    )     
SELECT * FROM t1;

Query for scenario 1:

SELECT *
FROM t
MATCH_RECOGNIZE (
  PARTITION BY ID
  ORDER BY CREATED_AT
  -- MEASURES MATCH_NUMBER() AS m, --LAST/FIRST/CLASSIFIER/...
  ALL ROWS PER MATCH
  PATTERN (c+m+)
  DEFINE
     c AS status='created'
    ,m AS status='missing_info'
    ,p AS status='pending'
    ,s AS status='succesful'
) mr
ORDER BY ID, CREATED_AT;
-- returns rows 1-4

Key point here is pattern which is provided as Perl-style regular expression. Here we are searching for pattern of one or more "create" finished by one or more "missing_info".

ALL ROWS PER MATCH - return all rows but it could be changed to first row if necessary

MEASURES: Specifying Additional Output Columns could be used to provide additional info like MATCH_NUMBER/MATCH_SEQUENCE_NUMBER/CLASSIFIER and more depending of specific needs.

More patterns in single query could be provided by using '|' (alternative): (c+m+|pm+|...)

EDIT:

"Thanks for the answer! It returns first 4 rows. I was essentially needed 1st and 4th row."

Once groups are identified, filtering first and last row could be achieved for instance with QUALIFY. The key is to use MEASURES that I mentioned before:

SELECT *
    FROM t
    MATCH_RECOGNIZE (
      PARTITION BY ID
      ORDER BY CREATED_AT
      MEASURES MATCH_NUMBER() AS mn,
               MATCH_SEQUENCE_NUMBER AS msn
      ALL ROWS PER MATCH
      PATTERN (c+m+)
      DEFINE
         c AS status='created'
        ,m AS status='missing_info'
        ,p AS status='pending'
        ,s AS status='succesful'
    ) mr
    QUALIFY (ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn) = 1)
          OR(ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn DESC)=1)
    ORDER BY ID, CREATED_AT;
    -- returns first and last row by group consisted of ID and MATCH_NUMBER

Thanks for the answer! It returns first 4 rows. I was essentially needed 1st and 4th row. I modified the query to retrieve the correct rows, but still no success.  I simplified the question. Do you mind if you can have a look again? I simply want to get rows of 2, 3, 5 and 6. — kimi, Jul 22 '21 at 10:12
@kimi That's why I have mentioned MEASURES. Once group is identified, finding first and last row could be accomplished with `QUALIFY`. I have also reverted the question to its original form and added updated section. Please do not make in-place updates as they may invalidate already provided input. — Lukasz Szozda, Jul 22 '21 at 13:36

How to get specific rows for each id?

EDIT:

1 Answers1