4

I have written this function which connects to dateutil.tz, refer to the code below:

CREATE OR REPLACE FUNCTION schema_name.fnc_name(ts timestamp without time zone, timezone character varying)
RETURNS boolean STABLE AS $$
  from datetime import datetime
  from dateutil.tz import gettz
  tstz = ts.replace(tzinfo=gettz(timezone))
  is_dst = datetime.timetuple(tstz).tm_isdst
  return is_dst
$$ LANGUAGE plpythonu;

This function is slow and I need to invoke it for more than a billion rows in an execution cycle.

I am really new to redshift and timezone stuff. Can someone please help me optimize it? Any performance improvement suggestions are appreciated, like:

  1. Move timezone details to local database somehow? (Tell me how)
  2. Don't use Python, use something else (Tell me what)
Deep Kalra
  • 1,418
  • 9
  • 27
  • By the way, a popular convention is to always store dates/times in UTC, which has the benefit of avoiding timezone problems. – John Rotenstein Apr 11 '16 at 09:48
  • Are you able to provide more information as to WHY you need this? Is there any scope for changing the way data is stored in the original table to avoid having to do this in future? – John Rotenstein Apr 11 '16 at 09:50
  • My whole idea around this is timezone normalization. I have to do this. I have timezone in UTC only. – Deep Kalra Apr 11 '16 at 10:24

3 Answers3

2

Use IMMUTABLE rather than STABLE since the return value will always be the same, given the input value. From the documentation:

STABLE: Given the same arguments, the function is guaranteed to return the same results for all rows processed within a single statement. The function can return different results when called in different statements. This category allows the optimizer to optimize multiple calls of the function within a single statement to a single call for the statement.

IMMUTABLE: Given the same arguments, the function always returns the same result, forever. When a query calls an IMMUTABLE function with constant arguments, the optimizer pre-evaluates the function.

Also, to enable Redshift to cache the result, pass in a DATE rather than a TIMESTAMP. This will reduce the number of input values used, so that they are more likely to use a previously-calculated (and cached) value.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • So in short, I should save functions result for each timezone in a table and do a join in my query? – Deep Kalra Apr 08 '16 at 15:54
  • No. Redshift will do that for you. I'm suggesting you change use IMMUTABLE rather than STABLE, and pass in a DATE rather than a TIMESTAMP. You will probably need to tweak the code to convert the DATE to the desired format. End result: Less input values, which helps Redshift automatically cache results, which should speed things up a bit. – John Rotenstein Apr 08 '16 at 21:12
  • That is wrong. Now for an example this year DST began at **2:00 AM** on Sunday, March 13, and by writing the function this way (passing date without time) I am missing the those **2 hours**. Or I am not getting what you said. – Deep Kalra Apr 10 '16 at 21:05
  • Ah! Yes, if it begins at a non-midnight border, it will not function as you desire. – John Rotenstein Apr 11 '16 at 09:49
  • I will try trimming out the minutes part and just keeping the timestamp till hour, hope that fastens the query (by your logic). Thanks again. – Deep Kalra Apr 11 '16 at 10:26
2

Note: this answer is about PostgreSQL. Most of the solution should be able to be applied to redshift as well, because it's based on an older version of PostgreSQL. However, you may need to search for alternatives for parts of this solution, as I cannot test this on redshift (f.ex. use the CONVERT_TIMEZONE(tz, ts) function instead of the ts AT TIME ZONE tz expression).

First, you need to understand that there are multiple "types" of time zones. F.ex. Europe/London is a time-zone name, and the database have information about its daylight-saving rules. However a time-zone offset (f.ex. UTC, UTC+2 or any time interval) is static, and never be considered to be a daylight-saving time (neither in python). There are also time-zone abbreviations, which are just aliases of time-zone offsets, but they have an alternate name for their DST variants (f.ex. CET in daylight-saving time is CEST) so they are never (or always) considered to be daylight-saving times (note that PostgreSQL accepts (and adjusts) bogus date-time inputs, like 2016-01-12 10:00 CEST which is really 2016-01-12 09:00 CET). Also, there are the POSIX-style time-zones, like EST5EDT which can have their own daylight-saving rules.

For a pure SQL detection, you need to query the pg_timezone_abbrevs and pg_timezone_names system views:

create or replace function tstz_isdst(ts timestamp without time zone, tz text)
  returns boolean
  immutable
  language sql
as $func$
  with tz_info as (
      select utc_offset, true fix_dst, is_dst
      from   pg_timezone_abbrevs
      where  lower(abbrev) = lower(tz)
    union all
      select utc_offset, false, is_dst
      from   pg_timezone_names
      where  lower(name) = lower(tz)
    union all
      select -coalesce(substring(tz from '([\+\-]?\d+(:\d+){1,2}(.\d+)?)')::interval,
                       substring(tz from '[\+\-]?\d+')::integer * interval '1 hour'),
             false, false
  )
  select case
           when fix_dst then is_dst
           when ts = (ts at time zone tz at time zone 'UTC' + utc_offset) then is_dst
           else not is_dst
         end
  from   tz_info
  limit  1
$func$;

select tstz_isdst('2016-01-12 10:00', 'GMT'),
       tstz_isdst('2016-04-12 10:00', 'BST'),
       tstz_isdst('2016-03-27 01:30', 'GMT0BST'), -- not exists
       tstz_isdst('2016-10-30 01:30', 'Europe/London'); -- ambiguous

Note that this function will return false for not existing or ambiguous date-time + time-zone combinations (when daylight-saving change is currently happening).

But this function may still not, what you need, because in PostgreSQL the pg_timezone_names view is calculated very slowly (300-600 ms in my tests), so it may not be optimal to query it for every row in a table. But you could use joins instead:

select t.ts, t.tz, case
         when tz_abbr.is_dst is not null
           then tz_abbr.is_dst
         when tz_name.utc_offset is not null
           then case
             when t.ts = (t.ts at time zone t.tz at time zone 'UTC' + tz_name.utc_offset)
               then tz_name.is_dst
             else not tz_name.is_dst
           end
         else t.ts <> (t.ts at time zone t.tz at time zone 'UTC' -
           coalesce(substring(t.tz from '([\+\-]?\d+(:\d+){1,2}(.\d+)?)')::interval,
                    substring(t.tz from '[\+\-]?\d+')::integer * interval '1 hour'))
       end is_dst
from   (values(timestamp '2016-01-12 10:00', 'GMT'),
              (timestamp '2016-04-12 10:00', 'BST'),
              (timestamp '2016-03-27 01:30', 'GMT0BST'),
              (timestamp '2016-10-30 01:30', 'Europe/London')) t(ts, tz)
left join pg_timezone_abbrevs tz_abbr on lower(tz_abbr.abbrev) = lower(t.tz)
left join pg_timezone_names   tz_name on lower(tz_name.name)   = lower(t.tz);
Community
  • 1
  • 1
pozs
  • 34,608
  • 5
  • 57
  • 63
1

Take a look at the CONVERT_TIMEZONE function

denismo
  • 760
  • 4
  • 8
  • Thanks for the answer. I know the function, how will that help me know if the daylight savings is active for a particular date or not? – Deep Kalra Apr 11 '16 at 08:02
  • CONVERT_TIMEZONE will respect daylight saving times. What are you going to do with is_dst? What is the more general problem you are trying to solve? – denismo Apr 11 '16 at 12:57