0

In Redshift, through SQL script want to consolidate monthly records as long as gap between the end date of first and the start date of the next record is 32 days or less (<=32) into single record with minimum startdate of continuous month as output startdate and maximum of end date of continuous month as output enddate.

The below input data refers to the table's data and also listed the expected output. The input data is listed ORDER BY ID,STARTDT,ENDDT in ASC.

For example, in below table, consider ID 100, the gab between the end of the first record and start of the next record <=32, however gap between the second record end date and third records start date falls more than 32 days, hence the first two records to be consolidate into one record i.e. (ID),MIN(STARTSDT),MAX(ENDDT) which corresponds to first record in the expected output. Similarly gab between 3 and 4 record in the input data falls within the 32 days and thus these 2 records to be consolidated into single records which corresponds to the second record in the expected output.

INPUT DATA:

ID STARTDT ENDDT
100 2000-01-01 2000-01-31
100 2000-02-01 2000-02-29
100 2000-05-01 2000-05-31
100 2000-06-01 2000-06-30
100 2000-09-01 2000-09-30
100 2000-10-01 2000-10-31
101 2012-06-01 2012-06-30
101 2012-07-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-03-31
103 2013-05-01 2013-05-31

EXPECTED OUTPUT:

ID MIN_STARTDT MAX_END_DT
100 2000-01-01 2000-02-29
100 2000-05-01 2000-06-30
100 2000-09-01 2000-10-31
101 2012-06-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-03-31
103 2013-05-01 2013-05-31
jarlh
  • 42,561
  • 8
  • 45
  • 63

2 Answers2

0

You can do this in steps:

  • Use a join to identify where two adjacent records should be combined.
  • Then do a cumulative sum to assign all such adjacent records a grouping identifier.
  • Aggregate.

It looks like:

  select id, min(startdt), max(enddte)
  from (select t.*,
               count(case when tprev.id is null then 1 else 0 end) over 
                     (partition by t.idid
                      order by t.startdt
                      rows between unbounded preceding and current row
                     ) as grp
        from t left join
             t tprev
             on t.id = tprev.id and
                t.startdt = tprev.enddt + interval '1 day'
       ) t
  group by id, grp;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

The question is very similar to this one and my answer is also similar: Fetch rows based on condition

The gist of the idea is to use Window Functions to identify transitions between period (events which are less than 33 days apart), and then do some filtering to remove the rows within the period, and then Window Functions again.

Complete solution:

SELECT
  id,
  startdt AS period_start,
  period_end
FROM (
  SELECT
    id,
    startdt,
    enddt,
    lead(enddt, 1)
    OVER (PARTITION BY id
      ORDER BY enddt) AS period_end,
    period_boundary
  FROM (
         SELECT
           id,
           startdt,
           enddt,
           CASE WHEN period_switch = 0 AND reverse_period_switch = 1
             THEN 'start'
           ELSE 'end' END AS period_boundary
         FROM (
                SELECT
                  id,
                  startdt,
                  enddt,
                  CASE WHEN datediff(days, enddt, lead(startdt, 1)
                  OVER (PARTITION BY id
                    ORDER BY enddt ASC)) > 32
                    THEN 1
                  ELSE 0 END AS period_switch,
                  CASE WHEN datediff(days, lead(enddt, 1)
                  OVER (PARTITION BY id
                    ORDER BY enddt DESC), startdt) > 32
                    THEN 1
                  ELSE 0 END AS reverse_period_switch
                FROM date_test
              )
           AS sessioned
         WHERE period_switch != 0 OR reverse_period_switch != 0
         UNION
         SELECT -- adding start rows without transition
           id,
           startdt,
           enddt,
           'start'
         FROM (
                SELECT
                  id,
                  startdt,
                  enddt,
                  row_number()
                  OVER (PARTITION BY id
                    ORDER BY enddt ASC) AS row_num
                FROM date_test
              ) AS with_row_number
         WHERE row_num = 1
         UNION
         SELECT -- adding end rows without transition
           id,
           startdt,
           enddt,
           'end'
         FROM (
                SELECT
                  id,
                  startdt,
                  enddt,
                  row_number()
                  OVER (PARTITION BY id
                    ORDER BY enddt desc) AS row_num
                FROM date_test
              ) AS with_row_number
         WHERE row_num = 1
       ) AS with_boundary -- data set containing start/end boundaries
) AS with_end -- data set where end date is propagated into the start row of the period
WHERE period_boundary = 'start'
ORDER BY id, startdt ASC;

Note that in your expected output, you had a row for 103 2013-05-01 2013-05-31, however its start date is 31 days apart from end date of the previous row, so this row should instead be merged with the previous row for id 103 according to your requirements.

So the output that I get looks like this:

 id    start       end
100  2000-01-01  2000-02-29
100  2000-05-01  2000-06-30
100  2000-09-01  2000-10-31
101  2012-06-01  2012-07-31
102  2000-01-01  2000-01-31
103  2013-03-01  2013-05-31
Community
  • 1
  • 1
denismo
  • 760
  • 4
  • 8