1

tl;dr: I want to generate a dates table in Redshift in order to make a report easier to generate. Preferable without needing large tables already in redshift, needing to upload a csv file.

long version: I am working on a report where I have to average new items created per day of the week. The date range could span months or more, so there could be, say, 5 Mondays but only 4 Sundays, which can make the math a little tricky. Also, I am not guaranteed an instance of a single item per day, especially once a user starts slicing the data. Which, this is tripping up the BI tool.

The best way to tackle this problem is most likely a dates table. However, most of the tutorials for dates tables use SQL commands that are not available or not fully supported by Redshift (I'm looking at you, generate_series).

Is there an easy way to generate a dates table in Redshift?

The code I was attempting to use: (based on this also-not-working recommendation: http://elliot.land/post/building-a-date-dimension-table-in-redshift )

CREATE TABLE facts.dates (
  "date_id"              INTEGER                     NOT NULL PRIMARY KEY,

  -- DATE
  "full_date"            DATE                        NOT NULL,

  -- YEAR
  "year_number"          SMALLINT                    NOT NULL,
  "year_week_number"     SMALLINT                    NOT NULL,
  "year_day_number"      SMALLINT                    NOT NULL,

  -- QUARTER
  "qtr_number"           SMALLINT                    NOT NULL,

  -- MONTH
  "month_number"         SMALLINT                    NOT NULL,
  "month_name"           CHAR(9)                     NOT NULL,
  "month_day_number"     SMALLINT                    NOT NULL,

  -- WEEK
  "week_day_number"      SMALLINT                    NOT NULL,

  -- DAY
  "day_name"             CHAR(9)                     NOT NULL,
  "day_is_weekday"       SMALLINT                    NOT NULL,
  "day_is_last_of_month" SMALLINT                    NOT NULL
) DISTSTYLE ALL SORTKEY (date_id)
;


INSERT INTO facts.dates
(
   "date_id"
  ,"full_date"
  ,"year_number"
  ,"year_week_number"
  ,"year_day_number"

  -- QUARTER
  ,"qtr_number"

  -- MONTH
  ,"month_number"
  ,"month_name"
  ,"month_day_number"

  -- WEEK
  ,"week_day_number"

  -- DAY
  ,"day_name"
  ,"day_is_weekday"
  ,"day_is_last_of_month"
)
  SELECT
    cast(seq + 1 AS INTEGER)                                      AS date_id,

    -- DATE
    datum                                                         AS full_date,

    -- YEAR
    cast(extract(YEAR FROM datum) AS SMALLINT)                    AS year_number,
    cast(extract(WEEK FROM datum) AS SMALLINT)                    AS year_week_number,
    cast(extract(DOY FROM datum) AS SMALLINT)                     AS year_day_number,

    -- QUARTER
    cast(to_char(datum, 'Q') AS SMALLINT)                         AS qtr_number,

    -- MONTH
    cast(extract(MONTH FROM datum) AS SMALLINT)                   AS month_number,
    to_char(datum, 'Month')                                       AS month_name,
    cast(extract(DAY FROM datum) AS SMALLINT)                     AS month_day_number,

    -- WEEK
    cast(to_char(datum, 'D') AS SMALLINT)                         AS week_day_number,

    -- DAY
    to_char(datum, 'Day')                                         AS day_name,
    CASE WHEN to_char(datum, 'D') IN ('1', '7')
      THEN 0
    ELSE 1 END                                                    AS day_is_weekday,
    CASE WHEN
      extract(DAY FROM (datum + (1 - extract(DAY FROM datum)) :: INTEGER +
                        INTERVAL '1' MONTH) :: DATE -
                       INTERVAL '1' DAY) = extract(DAY FROM datum)
      THEN 1
    ELSE 0 END                                                    AS day_is_last_of_month
  FROM
    -- Generate days for 81 years starting from 2000.
    (
      SELECT
        '2000-01-01' :: DATE + generate_series AS datum,
        generate_series                        AS seq
      FROM generate_series(0,81 * 365 + 20,1)
    ) DQ
  ORDER BY 1;

Which throws this error

[Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
1 statement failed.

... because, I assume, INSERT and generate_series are not allowed in the same command in Redshift

Phillip
  • 171
  • 1
  • 2
  • 9
  • 1
    As you've discovered, `generate_series()` cannot be used with actual data because it executes only on the leader node. Your method of generating a numbers table and then joining to it works well. Alternatively, creating a source file in Excel and just importing the results. A dates table like that works very well for reports. Other things you might want to add: public holiday flag, last day of quarter flag, last day of year flag (handy for reports grouped by last date of the period). – John Rotenstein Nov 11 '17 at 07:23
  • I like those additional columns. Thanks John! – Phillip Nov 13 '17 at 14:34

5 Answers5

3

In asking the question, I figured it out. Oops.

I started with a "facts" schema.

CREATE SCHEMA facts;

Run the following to start a numbers table:

create table facts.numbers
(
  number int PRIMARY KEY
)
;

Use this to generate your number list. I used a million to get started

SELECT ',(' || generate_series(0,1000000,1) || ')'
;

Then copy-paste the numbers from your results in the query below, after VALUES:

INSERT INTO facts.numbers
VALUES
 (0)
,(1)
,(2)
,(3)
,(4)
,(5)
,(6)
,(7)
,(8)
,(9)
-- etc

^ Make sure to remove the leading comma from the copy-pasted list of numbers

Once you have a numbers table, then you can generate a dates table (again, stealing code from elliot land http://elliot.land/post/building-a-date-dimension-table-in-redshift ) :

CREATE TABLE facts.dates (
  "date_id"              INTEGER                     NOT NULL PRIMARY KEY,

  -- DATE
  "full_date"            DATE                        NOT NULL,

  -- YEAR
  "year_number"          SMALLINT                    NOT NULL,
  "year_week_number"     SMALLINT                    NOT NULL,
  "year_day_number"      SMALLINT                    NOT NULL,

  -- QUARTER
  "qtr_number"           SMALLINT                    NOT NULL,

  -- MONTH
  "month_number"         SMALLINT                    NOT NULL,
  "month_name"           CHAR(9)                     NOT NULL,
  "month_day_number"     SMALLINT                    NOT NULL,

  -- WEEK
  "week_day_number"      SMALLINT                    NOT NULL,

  -- DAY
  "day_name"             CHAR(9)                     NOT NULL,
  "day_is_weekday"       SMALLINT                    NOT NULL,
  "day_is_last_of_month" SMALLINT                    NOT NULL
) DISTSTYLE ALL SORTKEY (date_id)
;


INSERT INTO facts.dates
(
   "date_id"
  ,"full_date"
  ,"year_number"
  ,"year_week_number"
  ,"year_day_number"

  -- QUARTER
  ,"qtr_number"

  -- MONTH
  ,"month_number"
  ,"month_name"
  ,"month_day_number"

  -- WEEK
  ,"week_day_number"

  -- DAY
  ,"day_name"
  ,"day_is_weekday"
  ,"day_is_last_of_month"
)
  SELECT
    cast(seq + 1 AS INTEGER)                                      AS date_id,

    -- DATE
    datum                                                         AS full_date,

    -- YEAR
    cast(extract(YEAR FROM datum) AS SMALLINT)                    AS year_number,
    cast(extract(WEEK FROM datum) AS SMALLINT)                    AS year_week_number,
    cast(extract(DOY FROM datum) AS SMALLINT)                     AS year_day_number,

    -- QUARTER
    cast(to_char(datum, 'Q') AS SMALLINT)                         AS qtr_number,

    -- MONTH
    cast(extract(MONTH FROM datum) AS SMALLINT)                   AS month_number,
    to_char(datum, 'Month')                                       AS month_name,
    cast(extract(DAY FROM datum) AS SMALLINT)                     AS month_day_number,

    -- WEEK
    cast(to_char(datum, 'D') AS SMALLINT)                         AS week_day_number,

    -- DAY
    to_char(datum, 'Day')                                         AS day_name,
    CASE WHEN to_char(datum, 'D') IN ('1', '7')
      THEN 0
    ELSE 1 END                                                    AS day_is_weekday,
    CASE WHEN
      extract(DAY FROM (datum + (1 - extract(DAY FROM datum)) :: INTEGER +
                        INTERVAL '1' MONTH) :: DATE -
                       INTERVAL '1' DAY) = extract(DAY FROM datum)
      THEN 1
    ELSE 0 END                                                    AS day_is_last_of_month
  FROM
    -- Generate days for 81 years starting from 2000.
    (
      SELECT
        '2000-01-01' :: DATE + number AS datum,
        number                        AS seq
      FROM facts.numbers
      WHERE number between 0 and 81 * 365 + 20
    ) DQ
  ORDER BY 1;

^ Be sure to set the numbers at the end for the date range you need

Phillip
  • 171
  • 1
  • 2
  • 9
2

As a workaround, you can spin Postgres instance on your local machine, run the code there, export to CSV, then run CREATE TABLE portion only in Redshift and load data from CSV. Since this is a one-time operation it's ok to do, this is what I'm actually doing for new Redshift deployments.

AlexYes
  • 4,088
  • 2
  • 15
  • 23
  • Very good idea, but I figured out a way to do it without uploading a csv. Does take some copy-paste magic, unfortunately. I posted my solution below if you have any improvements. – Phillip Nov 10 '17 at 21:19
2

Here is a different suggestion for building the facts.numbers that does not require manual intervention:

  1. Take a system table (guaranteed to exist) of a known or stable size
  2. Cross join that table to itself enough times to get the desired number of rows
  3. Select the row_number() over (order by 1) to turn those created records into an ascending set of numbers

Example using the Redshift system table pg_catalog.pg_operator (which as of Oct 2020 has 659 records):

-- Prep, so that you can copy/paste the code sample
create schema if not exists facts;   -- Make sure the schema exists
drop table if exists facts.numbers;  -- Avoid an error if that table already exists;
create table facts.numbers           -- Create the table definition
(
  number int primary key
);

-- The bit you care about
insert into facts.numbers
    select     row_number() over (order by 1) -- return 1..n in place of the original record
    from       pg_catalog.pg_operator a       -- 659 records
    cross join pg_catalog.pg_operator b       -- to get 659^2=434k records 
    cross join pg_catalog.pg_operator c       -- to get 659^3=286M records
    limit      2000000                        -- to limit the result to a reasonable size
;
Sam Davey
  • 331
  • 2
  • 5
1

Extending great ideas above - small fixes for starting from 2nd day of the year instead of 1st (BI tools should not be happy with this miss) + simplification and fix for flag is_last_day_of_month:

CREATE SCHEMA IF NOT EXISTS dimensions;   -- Make sure the schema exists
DROP TABLE IF EXISTS dimensions.numbers;  -- Avoid an error if that table already exists;
CREATE TABLE dimensions.numbers           -- Create the table definition
(
  number INT PRIMARY KEY
);
-- Work around for Generate_series() and INSERT INTO by Sam Davey
INSERT INTO dimensions.numbers
    SELECT     row_number() over (order by 1) -- return 1..n in place of the original record
    FROM       pg_catalog.pg_operator a       -- 659 records
    CROSS JOIN pg_catalog.pg_operator b       -- to get 659^2=434k records
    CROSS JOIN pg_catalog.pg_operator c       -- to get 659^3=286M records
    LIMIT      1000000                        -- to limit the result to a reasonable size
; 
-- Elliot solution http://elliot.land/post/building-a-date-dimension-table-in-redshift
CREATE TABLE dimensions.dates (
  "date_id"              INTEGER                     NOT NULL PRIMARY KEY,

  -- DATE
  "full_date"            DATE                        NOT NULL,

  -- YEAR
  "year_number"          SMALLINT                    NOT NULL,
  "year_week_number"     SMALLINT                    NOT NULL,
  "year_day_number"      SMALLINT                    NOT NULL,

  -- QUARTER
  "qtr_number"           SMALLINT                    NOT NULL,

  -- MONTH
  "month_number"         SMALLINT                    NOT NULL,
  "month_name"           CHAR(9)                     NOT NULL,
  "month_day_number"     SMALLINT                    NOT NULL,

  -- WEEK
  "week_day_number"      SMALLINT                    NOT NULL,

  -- DAY
  "day_name"             CHAR(9)                     NOT NULL,
  "day_is_weekday"       SMALLINT                    NOT NULL,
  "day_is_last_of_month" SMALLINT                    NOT NULL
) DISTSTYLE ALL SORTKEY (date_id);

INSERT INTO dimensions.dates
(
   "date_id"
  ,"full_date"
  ,"year_number"
  ,"year_week_number"
  ,"year_day_number"

  -- QUARTER
  ,"qtr_number"

  -- MONTH
  ,"month_number"
  ,"month_name"
  ,"month_day_number"

  -- WEEK
  ,"week_day_number"

  -- DAY
  ,"day_name"
  ,"day_is_weekday"
  ,"day_is_last_of_month"
)
  SELECT
    CAST(seq + 0 AS INTEGER)                                      AS date_id,

    -- DATE
    datum                                                         AS full_date,

    -- YEAR
    CAST(EXTRACT(YEAR FROM datum) AS SMALLINT)                    AS year_number,
    CAST(EXTRACT(WEEK FROM datum) AS SMALLINT)                    AS year_week_number,
    CAST(EXTRACT(DOY FROM datum) AS SMALLINT)                     AS year_day_number,

    -- QUARTER
    CAST(TO_CHAR(datum, 'Q') AS SMALLINT)                         AS qtr_number,

    -- MONTH
    CAST(EXTRACT(MONTH FROM datum) AS SMALLINT)                   AS month_number,
    TO_CHAR(datum, 'Month')                                       AS month_name,
    CAST(EXTRACT(DAY FROM datum) AS SMALLINT)                     AS month_day_number,

    -- WEEK
    CAST(TO_CHAR(datum, 'D') AS SMALLINT)                         AS week_day_number,

    -- DAY
    TO_CHAR(datum, 'Day')                                         AS day_name,
    CASE WHEN TO_CHAR(datum, 'D') IN ('1', '7')
      THEN 0
    ELSE 1 END                                                    AS day_is_weekday,
    CASE WHEN LAST_DAY(datum) = datum THEN 1 ELSE 0 END           AS day_is_last_of_month
  FROM
    -- Generate days for 81 years starting from 2000.
    (
      SELECT
        ('2000-01-01' :: DATE - interval '1 day')::DATE + number AS datum,
        number                        AS seq
      FROM dimensions.numbers
      WHERE number between 0 and  81 * 365 + 20
    ) DQ
  ORDER BY 1;
DROP TABLE dimensions.numbers;
1

Can try this for a cleaner and more efficient approach.

create table facts.numbers
(number int PRIMARY KEY);

Using recursive function to generate to required amount, took about ~60 sec for 500 million rows. Need to set max_recursion_rows else redshift will stop you from doing too deep of a recursion.

SET SESSION max_recursion_rows = 500000000;
SHOW max_recursion_rows;

INSERT INTO facts.numbers
    WITH RECURSIVE number_series (number) AS (
      SELECT 1
      UNION ALL
      SELECT number + 1
      FROM number_series
      WHERE number < 500000000 -- Adjust the limit as per your desired number of rows
    )
    SELECT *
    FROM number_series;
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 01 '23 at 06:09