3

I need to quickly turn an ISO 8601 datetime string--with no timezone in the string, but known to be in the US/Pacific timezone--into a numpy datetime64 object.

If my machine were in US/Pacific time, I could simply run numpy.datetime64(s). However, this assumes that strings without timezones are in the local timezone. Furthermore, I can't easily specify the US/Pacific timezone in ISO 8601 format, because it is sometimes -0800 and sometimes -0700 depending on daylight savings time.

So far, the fastest solution I have is numpy.datetime64(pandas.Timestamp(s).tz_localize(tz='US/Pacific', ambiguous=True)). This takes 70µs on my machine. It would be good if I could get this at least an order of magnitude faster (numpy.datetime64(s) in local time takes 4 µs but is incorrect as described above). Is this possible?

Ben Kuhn
  • 1,059
  • 2
  • 10
  • 25
  • 2
    If performance is a concern, you must be doing this millions of times, right? Is the offset the same each time? If so, perhaps just use `numpy.datetime64(s)` on all of them, and then use numpy arithmetic to shift all of them by the same offset amount in one fell blow. – unutbu Mar 26 '15 at 00:56
  • @unutbu: Unfortunately, the offset is not necessarily the same: some of the datetimes are during DST periods and some of them are not. Also, it seems like even calculating the DST adjustment per-datetime myself would be error-prone? – Ben Kuhn Mar 26 '15 at 01:44
  • 2
    I'm not sure. I haven't found a faster way. But in any case, beware that `np.datetime64(pd.Timestamp(s, tz='US/Pacific'))` may not be giving you the desired result. Consider the ISO 8601 datetime string `'2000-04-02T03:00:00-07:00'`. Without the timezone you'd have `s = '2000-04-02T03:00:00'`. But `np.datetime64(pd.Timestamp(s, tz='US/Pacific'))` returns `numpy.datetime64('2000-04-02T04:00:00.000000-0700')`, so this is not round-tripping properly. – unutbu Mar 26 '15 at 02:07
  • what is your OS? Python version? `numpy` version? `pandas` version? Can you upgrade? – jfs Mar 26 '15 at 07:58
  • @unutbu: Good catch on the incorrectness! The right call is `np.datetime64(pd.Timestamp(s).tz_localize(tz=tz, ambiguous=True))`, which agrees with the faster method in your answer but is almost 2x slower than the original (so 200x slower than your result). – Ben Kuhn Mar 26 '15 at 18:36

1 Answers1

3

First note that without the offset some localtimes and therefore their datetime strings are ambiguous. For example, the ISO 8601 datetime strings

2000-10-29T01:00:00-07:00
2000-10-29T01:00:00-08:00

both map to the same string 2000-10-29T01:00:00 when the offset is removed.

So it may not always be possible to reconstitute a unique timezone-aware datetime from a datetime string without offset.

However, we could make a choice in these ambigous situations and accept that not all ambiguous dates will be correctly converted.


If you are using Unix, you can use time.tzset to change the process's local timezone:

import os
import time
os.environ['TZ'] = tz
time.tzset()

You could then convert the datetime strings to NumPy datetime64's using

def using_tzset(date_strings, tz):
    os.environ['TZ'] = tz
    time.tzset()
    return np.array(date_strings, dtype='datetime64[ns]')

Note however that using_tzset does not always produce the same value as the method you proposed:

import os
import time
import numpy as np
import pandas as pd

tz = 'US/Pacific'
N = 10**5
dates = pd.date_range('2000-1-1', periods=N, freq='H', tz=tz)
date_strings_tz = dates.format(formatter=lambda x: x.isoformat())
date_strings = [d.rsplit('-', 1)[0] for d in date_strings_tz]

def orig(date_strings, tz):
    return [np.datetime64(pd.Timestamp(s, tz=tz)) for s in date_strings]

def using_tzset(date_strings, tz):
    os.environ['TZ'] = tz
    time.tzset()
    return np.array(date_strings, dtype='datetime64[ns]')

npdates = dates.asi8.view('datetime64[ns]')
x = np.array(orig(date_strings, tz))
y = using_tzset(date_strings, tz)
df = pd.DataFrame({'dates': npdates, 'str': date_strings_tz, 'orig': x, 'using_tzset': y})

This indicates that the original method, orig, fails to recover the original date 172 times:

print((df['dates'] != df['orig']).sum())
172

while using_tzset fails 11 times:

print((df['dates'] != df['using_tzset']).sum())
11  

Note however, that the 11 times that using_tzset fails are due to the ambiguity in local datetimes due to DST.

This shows some of the discrepancies:

mask = df['dates'] != df['using_tzset']
idx = np.where(mask.shift(1) | mask)[0]
print(df[['dates', 'str', 'using_tzset']].iloc[idx]).head(6)

#                     dates                        str         using_tzset
# 7248  2000-10-29 08:00:00  2000-10-29T01:00:00-07:00 2000-10-29 08:00:00
# 7249  2000-10-29 09:00:00  2000-10-29T01:00:00-08:00 2000-10-29 08:00:00
# 15984 2001-10-28 08:00:00  2001-10-28T01:00:00-07:00 2001-10-28 08:00:00
# 15985 2001-10-28 09:00:00  2001-10-28T01:00:00-08:00 2001-10-28 08:00:00
# 24720 2002-10-27 08:00:00  2002-10-27T01:00:00-07:00 2002-10-27 08:00:00
# 24721 2002-10-27 09:00:00  2002-10-27T01:00:00-08:00 2002-10-27 08:00:00

As you can see the discrepancies occur when the date strings in the str column become ambiguous when the offset is removed.

So using_tzset appears to produce the correct result up to ambiguous datetimes.


Here is a timeit benchmark comparing orig and using_tzset:

In [95]: %timeit orig(date_strings, tz)
1 loops, best of 3: 5.43 s per loop

In [96]: %timeit using_tzset(date_strings, tz)
10 loops, best of 3: 41.7 ms per loop

So using_tzset is over 100x faster than orig when N = 10**5.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Awesome! I didn't realize it was possible to set the timezone and have Numpy pick it up. You said "the machine's local timezone", but this is actually the *process's* timezone, right? It won't interfere with other stuff on the machine? – Ben Kuhn Mar 26 '15 at 18:34
  • Yes; thanks for the correction. `tzset` changes the process's notion of local timezone. – unutbu Mar 26 '15 at 18:49
  • Awesome. Sounds like this is the way we'll go until there's a way to give numpy a locale to parse in. – Ben Kuhn Mar 26 '15 at 18:52
  • @BenKuhn: if the timestamps are ordered then `pytz` allows to [handle ambiguous times too](http://stackoverflow.com/a/26221183/4279) – jfs Mar 27 '15 at 09:29
  • @J.F.Sebastian: Did you time your script? I tried to adapt it, but unfortunately, it was basically the same as the Pandas method I outlined above. – Ben Kuhn Mar 27 '15 at 15:02
  • @BenKuhn: the point of the example is to get the correct result. You can use it to test your faster methods. – jfs Mar 27 '15 at 19:55