Redshift GROUP BY time interval

Question

Currently, I have the following raw data in redshift.

timestamp                   ,lead
==================================
"2008-04-09 10:02:01.000000",true
"2008-04-09 10:03:05.000000",true
"2008-04-09 10:31:07.000000",true
"2008-04-09 11:00:05.000000",false
...

So, I would like to generate an aggregated data, with interval of 30 mins. My wished outcome is

timestamp                   ,count
==================================
"2008-04-09 10:00:00.000000",2
"2008-04-09 10:30:00.000000",1
"2008-04-09 11:00:00.000000",0
...

I had referred to https://stackoverflow.com/a/12046382/3238864 , which is valid for PostgreSQL.

I try to mimic the code posted, by using

with thirty_min_intervals as (
    select
      (select min(timestamp)::date from events) + ( n    || ' minutes')::interval start_time,
      (select min(timestamp)::date from events) + ((n+30) || ' minutes')::interval end_time
    from generate_series(0, (24*60), 30) n
)
select count(CASE WHEN lead THEN 1 END) from events e
right join thirty_min_intervals f
on e.timestamp >= f.start_time and e.timestamp < f.end_time
group by f.start_time, f.end_time
order by f.start_time;

However, I'm getting error

[0A000] ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.

May I know, what is the good way to perform aggregation data calculation, of N interval, in redshift.

Amazon-redshift does not support `generate_series()` function. [Refer this for unsupported postgresql features](http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html) — Viki888, Feb 06 '17 at 08:23
But if you run the bare command `select * from generate_series(0, (24*60), 30) n;` in redshift, it runs ok. — Cheok Yan Cheng, Feb 06 '17 at 08:54
Yes. The `generate_series` will be working in parent node. If you try to access the redshift table in query which has generate_series, it will through you this error, since the child node will not support generate_series() function. If your query does not access redshift tables, then generate_series() function will give you result. — Viki888, Feb 06 '17 at 09:09
One option is to create a table that has a list of times in 30-minute intervals, then join against it. I think that `generate_series()` can be used to create the table. — John Rotenstein, Feb 06 '17 at 09:43
@JohnRotenstein You can not create a table using `generate_series()`. Redshift does not support it. You have to create such table with multiple select queries. — Viki888, Feb 07 '17 at 08:01

hibernado · Answer 1 · 2018-11-27T17:39:59.877

Joe's answer is a great neat solution. I feel one should always consider how the data is distributed and sorted when you are working in Redshift. It can have a dramatic impact on performance.

Building on Joe's great answer: I will materialise the sample events. In practise the events will be in a table.

drop table if exists public.temporary_events;
create table public.temporary_events AS 
select ts::timestamp as ts 
    ,lead 
from 
(   SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;

Now run explain:

explain 
WITH time_dimension
AS (SELECT  dtm
           ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
    FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
         /*  NB: `sysdate` could be substituted for an arbitrary ending timestamp */
         (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
          FROM /* Generate a number sequence of 100,000 values from a large internal table */
               (SELECT  ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)

SELECT dtm_half_hour
      ,COUNT(CASE WHEN lead THEN 1 END)
FROM      time_dimension td
LEFT JOIN public.temporary_events e
       ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
-- ORDER BY 1 Just to simply the job a little

The output is:

XN HashAggregate  (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
  ->  XN Hash Left Join DS_DIST_BOTH  (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
        Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval))
        Inner Dist Key: e."ts"
        Hash Cond: ("outer"."?column2?" = "inner"."ts")
        ->  XN Subquery Scan rn  (cost=0.00..14.95 rows=1 width=8)
              Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone))
              ->  XN Limit  (cost=0.00..1.95 rows=130 width=0)
                    ->  XN Window  (cost=0.00..1.95 rows=130 width=0)
                          ->  XN Network  (cost=0.00..1.30 rows=130 width=0)
                                Send to slice 0
                                ->  XN Seq Scan on stl_scan  (cost=0.00..1.30 rows=130 width=0)
        ->  XN Hash  (cost=0.04..0.04 rows=4 width=9)
              ->  XN Seq Scan on temporary_events e  (cost=0.00..0.04 rows=4 width=9)

Kablamo!

As Joe says you may well use this pattern merrily without issue. However once your data gets sufficiently large OR your SQL logic complex you may want to optimise. If for no other reason you might like to understand the explain plan when you add more sql logic into your code.

Three areas we can look at:

The Join. Make the join between both sets of data work on the same datatype. Here we join a timestamp to an interval.
Data distribution. Materialise and distribute both tables by timestamp.
Data sorting. If events is sorted by this timestamp and the time dimension is sorted by both timestamps then you may be able to complete the entire query using a merge join without any data moving and without sending the data to the leader node for aggregation.

Observe:

drop table if exists public.temporary_time_dimension;
create table public.temporary_time_dimension
distkey(dtm) sortkey(dtm, dtm_half_hour)
AS (SELECT  dtm::timestamp as dtm
           ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
    FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
         /*  NB: `sysdate` could be substituted for an arbitrary ending timestamp */
         (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm         
          FROM /* Generate a number sequence of 100,000 values from a large internal table */
               (SELECT  ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
;               

drop table if exists public.temporary_events;
create table public.temporary_events 
distkey(ts) sortkey(ts)
AS 
select ts::timestamp as ts 
    ,lead 
from 
(   SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;

explain 
SELECT 
     dtm_half_hour
    ,COUNT(CASE WHEN lead THEN 1 END)
FROM public.temporary_time_dimension td
LEFT JOIN public.temporary_events e
       ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
--order by dtm_half_hour

This then gives:

XN HashAggregate  (cost=1512.67..1512.68 rows=1 width=9)
  ->  XN Merge Left Join DS_DIST_NONE  (cost=0.00..1504.26 rows=1682 width=9)
        Merge Cond: ("outer".dtm = "inner"."ts")
        ->  XN Seq Scan on temporary_time_dimension td  (cost=0.00..1500.00 rows=1682 width=16)
              Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone))
        ->  XN Seq Scan on temporary_events e  (cost=0.00..0.04 rows=4 width=9)

Important caveats:

I've taken the order by out. putting it back in will result in the data being sent to the leader node for sorting. If you can do away with the sort then do away with the sort!
I'm certain that choosing timestamp as the events table sort key will NOT be ideal in many situations. I just thought I'd show what is possible.
I think you will likely want to have a time dimension created with diststyle all and sorted. This will ensure that your joins do not generate network traffic.

score 2 · Accepted Answer · answered Feb 16 '17 at 21:01

You can use ROW_NUMBER() to generate a series. I use internal tables that I know to be large. FWIW, I would typically persist the time_dimension to a real table to avoid doing this repeatedly.

Here you go:

WITH events
AS (          SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean  as lead
    UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)

,time_dimension
AS (SELECT  dtm
           ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
    FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
         /*  NB: `sysdate` could be substituted for an arbitrary ending timestamp */
         (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
          FROM /* Generate a number sequence of 100,000 values from a large internal table */
               (SELECT  ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)

SELECT dtm_half_hour
      ,COUNT(CASE WHEN lead THEN 1 END)
FROM      time_dimension td
LEFT JOIN events e
       ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
ORDER BY 1
;

On redshift it is mandatory to put a derived table like time_dimension into a real table. Because the 'time_dimension' is made from a couple of nested statements, the query optimiser does not know how the data is sorted and distributed across the database nodes. It will assume the worst and boink. — hibernado, Nov 16 '18 at 21:04
@hibernado Can you please clarify your comment? I use the pattern above *daily* without any issue. — Joe Harris, Nov 26 '18 at 14:27

Redshift GROUP BY time interval

2 Answers2