1

I have data in a table with the following structure where all id's are unique and there are many different start and end combinations, with the difference between start and end ranging from less than a minute up to the hundreds of minutes. I only need to split out the duration for start/end differences greater than 60 minutes long.

| ID     | DURATION_START          | DURATION_END            |
|--------|-------------------------|-------------------------|
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 |

and I want to create a table that will split out the duration by hour with the id of the original entry attached, like below:

| ID     | DURATION_START          | DURATION_END            |
|--------|-------------------------|-------------------------|
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 01:07:59.000 |
| 0abc23 | 2019-06-29 01:08:00.000 | 2019-06-29 02:07:59.000 |
| 0abc23 | 2019-06-29 02:08:00.000 | 2019-06-29 03:07:59.000 |
| 0abc23 | 2019-06-29 03:08:00.000 | 2019-06-29 04:07:59.000 |
| 0abc23 | 2019-06-29 04:08:00.000 | 2019-06-29 05:18:00.000 |

I have tried the following code from a similar question for a different database engine:

with table1 as (
      select id
          , duration_start as fromdate
          , duration_end as todate
      from test_data
),
tab1 as (
      select id
        , fromdate
        , dateadd(second, 60*60 - 1, fromdate) as todate1 // add an hour minus a second
        , todate
      from table1

      union all

      select id
        , dateadd(minutes, 60, todate1) as fromdate // add an hour to todate1
        , dateadd(second, 2*60*60-1, todate1) as todate1 // add 1 hours to todate1
        , todate
      from tab1 where date_trunc(hour, todate1) < date_trunc(hour, todate)
),
tab2 as (
      select id
        , fromdate
        , case when todate1 > todate then todate else todate1 end as todate
      from tab1

but that gives me

| ID     | DURATION_START          | DURATION_END            |
|--------|-------------------------|-------------------------|
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 01:07:59.000 |
| 0abc23 | 2019-06-29 02:07:59.000 | 2019-06-29 03:07:58.000 |
| 0abc23 | 2019-06-29 04:07:58.000 | 2019-06-29 05:07:57.000 |
| 0abc23 | 2019-06-29 06:07:57.000 | 2019-06-29 07:07:56.000 |
| 0abc23 | 2019-06-29 08:07:56.000 | 2019-06-29 09:07:55.000 |

I've tried adjusting the amount of time added but haven't been able to get a 1 hour offset between the start and end. I've also attempted to use connect by and level (suggested for similar question but on oracle) but haven't been able to get that to compile. I just get a invalid identifier error for LEVEL.

select dateadd(hour,24,duration_start)
      , greatest(duration_start, date_trunc(hour,dateadd(hour,((level-1)::int/24)::int,duration_start)))
      , least(duration_start, date_trunc(hour,dateadd(hour,((level)::int/24)::int,duration_start)))
from test_data
connect by prior LEVEL = floor(datediff(hour, duration_start, duration_end)::int*24)+1;

Even just trying to use connect by gives me an error:

select *
from test_data
connect by prior LEVEL = floor(datediff(hour, duration_start, duration_end)::int*24)+1;

error: SQL compilation error: error line 0 at position -1 invalid identifier 'HOUR'

If anyone can tell me how to adjust my approach for snowflake, that would be much appreciated, thanks!

jarlh
  • 42,561
  • 8
  • 45
  • 63
thleo
  • 742
  • 2
  • 8
  • 21

2 Answers2

1

Doing that in pure SQL can be done with various tricks, but IMO by far the easiest is to create a short JavaScript UDTF for this.

Full example with some additional rows for testing below. Note that some corner cases might need to be additionally covered, e.g. what if your end time is before start time, or if one of them is null. But this should help.

create or replace table x(
  id varchar, 
  duration_start timestamp_ntz,
  duration_end timestamp_ntz
) as 
select * from values
  ('0abc23', '2019-06-29 00:08:00.000', '2019-06-29 09:18:00.000'),
  ('id_2__', '2002-02-02 02:00:00.000', '2002-02-02 02:00:00.000'),
  ('id_3__', '2003-03-03 03:00:00.000', '2003-03-03 04:00:00.000'),
  ('id_4__', '2004-04-04 04:00:00.000', '2004-04-04 04:59:59.000');

create or replace function magic(
  id varchar, 
  duration_start timestamp_ntz,
  duration_end timestamp_ntz
) 
returns table (
  hour_start timestamp_ntz, 
  hour_end timestamp_ntz
) language javascript as 
$$
{
  processRow: function(row, rowWriter, context) {
    let msSecond = 1000;
    let msHour = 60 * 60 * msSecond;

    let msStart = row.DURATION_START.getTime();
    let msEnd = row.DURATION_END.getTime();

    while (msStart <= msEnd) {
      let curEnd = Math.min(msEnd, msStart + msHour - msSecond);
      rowWriter.writeRow({
        HOUR_START: new Date(msStart),
        HOUR_END: new Date(curEnd)
      });
      msStart = curEnd + msSecond;
    }
  }
}
$$;

select * from x, table(magic(id, duration_start, duration_end)) ;
--------+-------------------------+-------------------------+-------------------------+-------------------------+
   ID   |     DURATION_START      |      DURATION_END       |       HOUR_START        |        HOUR_END         |
--------+-------------------------+-------------------------+-------------------------+-------------------------+
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:08:00.000 | 2019-06-29 01:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 01:08:00.000 | 2019-06-29 02:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 02:08:00.000 | 2019-06-29 03:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 03:08:00.000 | 2019-06-29 04:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 04:08:00.000 | 2019-06-29 05:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 05:08:00.000 | 2019-06-29 06:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 06:08:00.000 | 2019-06-29 07:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 07:08:00.000 | 2019-06-29 08:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 08:08:00.000 | 2019-06-29 09:07:59.000 |
 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 09:08:00.000 | 2019-06-29 09:18:00.000 |
 id_2__ | 2002-02-02 02:00:00.000 | 2002-02-02 02:00:00.000 | 2002-02-02 02:00:00.000 | 2002-02-02 02:00:00.000 |
 id_3__ | 2003-03-03 03:00:00.000 | 2003-03-03 04:00:00.000 | 2003-03-03 03:00:00.000 | 2003-03-03 03:59:59.000 |
 id_3__ | 2003-03-03 03:00:00.000 | 2003-03-03 04:00:00.000 | 2003-03-03 04:00:00.000 | 2003-03-03 04:00:00.000 |
 id_4__ | 2004-04-04 04:00:00.000 | 2004-04-04 04:59:59.000 | 2004-04-04 04:00:00.000 | 2004-04-04 04:59:59.000 |
--------+-------------------------+-------------------------+-------------------------+-------------------------+

Note that the results include the original start/end, but they don't have to.

Marcin Zukowski
  • 4,281
  • 1
  • 19
  • 28
  • What is the efficiency of using a JS UDTF for this vs sql? – thleo Aug 02 '19 at 18:55
  • Depends. For simple operations, it's slower. But if you have logic that is hard or tricky in SQL, it will often be much faster (and easier to implement). In this case, since I'm not sure how to do it in pure SQL, hard to be certain. – Marcin Zukowski Aug 04 '19 at 05:38
1

For this type of project it helps to have separate date & interval tables.

here is the script to achieve what you need.

create a calendar table with whatever date range you need.

I start with '1970-01-01' the unix epoch, as I keep a calendar table handy for situations like this.

create or replace table calendar(calendar_date DATE)

insert into calendar(calendar_date)
select
dateadd(d,rid2,to_timestamp_ntz('1970-01-01')) as calendar_date
from
(
select 0 as rid2 union
select row_number() over (order by null) as rid2
from table (generator(rowcount => 100000))
) t
where dateadd(d,rid2,to_timestamp_ntz('1970-01-01')) < '2030-01-01'
order by 1 asc;

Next create a time / interval table. The interval table. In your example your interval duration is 1 hour. So the following was created.

create or replace table interval(id integer, interval_start time);

insert into interval (id,interval_start)
select
id,
to_time(dateadd(hour,id,to_timestamp_ntz('1970-01-01')))
from
(
select 0 as id union
select row_number() over (order by null) as id
from table (generator(rowcount => 23))
) t;

Next i created a table with your sample data, as well as a couple other values so that the calculations can be validated across different scenarios.

create or replace table example1(id varchar(10), DURATION_START datetime, DURATION_END datetime);
-- drop table example1
truncate table example1;
--
insert into  example1 values('0abc23','2019-06-29 00:08:00.000','2019-06-29 09:18:00.000');
insert into  example1 values('0abc24','2019-06-28 11:07:45.000','2019-06-28  12:08:45.000');
insert into  example1 values('0abc25','2019-06-28 01:00:00.000','2019-06-29 02:15:00.000');
insert into  example1 values('0abc26','2019-06-28 00:08:00.000','2019-06-29 15:18:00.000');

Given that everything is setup, the query below will give you the result you need.

select 
      f.id
     ,f.DURATION_START
     ,f.DURATION_END
     ,f.start_time_HOUR_START
     ,f.end_time_HOUR_START
     ,q.CALENDAR_DATE
     ,q.HOUR_START
     ,q.HOUR_END
     ,case
    -- when starts during interval and ends after interval
           when f.DURATION_START >= q.HOUR_START and f.DURATION_END >= dateadd(hour, 1, q.HOUR_START)
               then datediff(s, f.DURATION_START, dateadd(hour, 1, q.HOUR_START))
    -- when starts during interval and ends during interval
           when f.DURATION_START >= q.HOUR_START and f.DURATION_END <= dateadd(hour, 1, q.HOUR_START)
               then datediff(s, f.DURATION_START, f.DURATION_END)
    -- when starts before interval and ends during interval
           when f.DURATION_START <= q.HOUR_START and f.DURATION_END <= dateadd(hour, 1, q.HOUR_START)
               then datediff(s, q.HOUR_START, f.DURATION_END)
    -- entire interval , starts before, and ends after
           when
               f.DURATION_START <= q.HOUR_START and f.DURATION_END >= dateadd(hour, 1, q.HOUR_START)
               then datediff(s, q.HOUR_START, dateadd(hour, 1, q.HOUR_START))
           else 0 end as seconds_elapsed
from (
         select *
              , to_timestamp(
                 dateadd(s, datediff(s, '1970-01-01', DURATION_START) - (datediff(s, '1970-01-01', DURATION_START) % 3600),
                         '1970-01-01')) as start_time_HOUR_START
              , to_timestamp(
                 dateadd(s, datediff(s, '1970-01-01', DURATION_END) - (datediff(s, '1970-01-01', DURATION_END) % 3600),
                         '1970-01-01')) as end_time_HOUR_START
         from example1
     ) f
         inner join

     (
         select
                distinct
                q1.calendar_date
              -- , t2.rid2
              , dateadd(hour, t2.id, to_timestamp(q1.calendar_date)) as HOUR_START
              , dateadd(hour, t2.id + 1, to_timestamp(q1.calendar_date)) as HOUR_END
         from (
                  select calendar_date
                  from calendar
                  where calendar_date between (select to_date(min(DURATION_START)) from example1) and (select to_date(max(DURATION_END)) from example1)
              ) q1
                  cross join
              interval as t2
        -- order by HOUR_START
     ) q on q.HOUR_START between f.start_time_HOUR_START and f.end_time_HOUR_START

ORDER BY f.id
       , f.DURATION_START
       , f.DURATION_END
       , q.CALENDAR_DATE
       , q.HOUR_START
;

Sample output below. run the script to get the final result:

+--------+-------------------------+-------------------------+-------------------------+-------------------------+---------------+-------------------------+-------------------------+-----------------+
| ID     | DURATION_START          | DURATION_END            | START_TIME_HOUR_START   | END_TIME_HOUR_START     | CALENDAR_DATE | HOUR_START              | HOUR_END                | SECONDS_ELAPSED |
|--------+-------------------------+-------------------------+-------------------------+-------------------------+---------------+-------------------------+-------------------------+-----------------|
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 00:00:00.000 | 2019-06-29 01:00:00.000 |            3120 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 01:00:00.000 | 2019-06-29 02:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 02:00:00.000 | 2019-06-29 03:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 03:00:00.000 | 2019-06-29 04:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 04:00:00.000 | 2019-06-29 05:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 05:00:00.000 | 2019-06-29 06:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 06:00:00.000 | 2019-06-29 07:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 07:00:00.000 | 2019-06-29 08:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 08:00:00.000 | 2019-06-29 09:00:00.000 |            3600 |
| 0abc23 | 2019-06-29 00:08:00.000 | 2019-06-29 09:18:00.000 | 2019-06-29 00:00:00.000 | 2019-06-29 09:00:00.000 | 2019-06-29    | 2019-06-29 09:00:00.000 | 2019-06-29 10:00:00.000 |            1080 |
| 0abc24 | 2019-06-28 11:07:45.000 | 2019-06-28 12:08:45.000 | 2019-06-28 11:00:00.000 | 2019-06-28 12:00:00.000 | 2019-06-28    | 2019-06-28 11:00:00.000 | 2019-06-28 12:00:00.000 |            3135 |
| 0abc24 | 2019-06-28 11:07:45.000 | 2019-06-28 12:08:45.000 | 2019-06-28 11:00:00.000 | 2019-06-28 12:00:00.000 | 2019-06-28    | 2019-06-28 12:00:00.000 | 2019-06-28 13:00:00.000 |             525 |

For a link to the code click

john.da.costa
  • 4,682
  • 4
  • 29
  • 30