3

I have a data set in CSV file, which I am loading into the Pandas DataFrame to be indexed by date + time. The original data is on the New York City time (although it's not reflected in the raw data) and I need to analyze it from the perspective of the London time.

I make the index TimeZone-aware for the EST TimeZone after loading it into the DataFrame and then convert it into the Europe/London TimeZone. It seems to work fine, but not for the dates that fall into the transition period, during which the time difference between London and New York is 4 hours, rather than 5 (like right now, in late March).

The data in CSV file looks like this:

2/15/16 10:00   1
2/15/16 10:01   2
3/24/16 10:00   3
3/24/16 10:01   4
3/24/16 10:02   5
3/24/16 10:03   6
3/24/16 11:00   7
3/24/16 11:01   8

And here is the simplified code:

import datetime
import pandas as pd

df = pd.read_csv('data/dates.csv', usecols=[0, 1, 2], header=None)
df.columns = ['dt', 'tm', 'val']
df.set_index(pd.DatetimeIndex(df.dt + ' ' + df.tm), inplace=True)
del df['dt']
del df['tm']
print(df)
df.index = df.index.tz_localize('EST', ambiguous='infer').tz_convert('Europe/London')
print(df)

Output from the first print() statement is as follows:

                    val
2016-02-15 10:00:00    1
2016-02-15 10:01:00    2
2016-03-24 10:00:00    3
2016-03-24 10:01:00    4
2016-03-24 10:02:00    5
2016-03-24 10:03:00    6
2016-03-24 11:00:00    7
2016-03-24 11:01:00    8

And output from the second print() is as follows:

                         val
2016-02-15 15:00:00+00:00    1
2016-02-15 15:01:00+00:00    2
2016-03-24 15:00:00+00:00    3
2016-03-24 15:01:00+00:00    4
2016-03-24 15:02:00+00:00    5
2016-03-24 15:03:00+00:00    6
2016-03-24 16:00:00+00:00    7
2016-03-24 16:01:00+00:00    8

My issue is that while the normal +5hour adjustment is done right for 15-Feb-2016, it's not done right for 24-March-2016, as that date falls into the transition period when the NYC-LDN time difference is 4 hours, rather than 5.

I have read numerous posts on how to do the adjustment right, and was hoping that the ambiguous='infer' bit might help, but seems that I'm still doing it wrong?

Finally, if the above is difficult to achieve, I won't mind doing a manual workaround. What I ultimately need to do is to select the sub-set of the DataFrame into another DataFrame, based on hours and minutes (from the London time perspective). I'm currently doing it like this:

dfSelected = df[(df.index.time >= datetime.time(15, 1))][['val']]

I was thinking of trying something manual and ugly like this:

shortWnd = [datetime.date(2016, 3, 24), datetime.date(2016, 3, 23)] # etc....
dfSelected = df[(df.index.time >= datetime.time(15, 1) if (df.index.date in shortWnd) else (df.index.time >= datetime.time(14, 1)  ))][['val']]

Whereby I'm adjusting the selection window manually, based on each row in the source DataFrame being within the transition period or not, but I'm getting this error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Any ideas on how to do this better?!

VS_FF
  • 2,353
  • 3
  • 16
  • 34

2 Answers2

4

You need to use US/Eastern for your timezone. EST (eastern standard time) does not take into account DST.

In [47]: pd.Timestamp('20160315').tz_localize('EST')
Out[47]: Timestamp('2016-03-15 00:00:00-0500', tz='EST')

In [48]: pd.Timestamp('20160315').tz_localize('US/Eastern')
Out[48]: Timestamp('2016-03-15 00:00:00-0400', tz='US/Eastern')

In [29]: df = pd.read_csv(StringIO(data), header=None, sep='\s+', parse_dates=[[0, 1]])

In [30]: df.columns = ['date', 'value']

In [31]: df
Out[31]: 
                 date  value
0 2016-02-15 10:00:00      1
1 2016-02-15 10:01:00      2
2 2016-03-24 10:00:00      3
3 2016-03-24 10:01:00      4
4 2016-03-24 10:02:00      5
5 2016-03-24 10:03:00      6
6 2016-03-24 11:00:00      7
7 2016-03-24 11:01:00      8

In [32]: df['date_tz'] = df['date'].dt.tz_localize('US/Eastern').dt.tz_convert('Europe/London')

In [33]: df['date_tz2'] = df['date'].dt.tz_localize('EST').dt.tz_convert('Europe/London')

In [34]: df
Out[34]: 
                 date  value                   date_tz                  date_tz2
0 2016-02-15 10:00:00      1 2016-02-15 15:00:00+00:00 2016-02-15 15:00:00+00:00
1 2016-02-15 10:01:00      2 2016-02-15 15:01:00+00:00 2016-02-15 15:01:00+00:00
2 2016-03-24 10:00:00      3 2016-03-24 14:00:00+00:00 2016-03-24 15:00:00+00:00
3 2016-03-24 10:01:00      4 2016-03-24 14:01:00+00:00 2016-03-24 15:01:00+00:00
4 2016-03-24 10:02:00      5 2016-03-24 14:02:00+00:00 2016-03-24 15:02:00+00:00
5 2016-03-24 10:03:00      6 2016-03-24 14:03:00+00:00 2016-03-24 15:03:00+00:00
6 2016-03-24 11:00:00      7 2016-03-24 15:00:00+00:00 2016-03-24 16:00:00+00:00
7 2016-03-24 11:01:00      8 2016-03-24 15:01:00+00:00 2016-03-24 16:01:00+00:00
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • You are right on thanks! This seems to also take care of another issue I've been struggling with -- when time zone conversion should also adjust the day to +1 (as in US/Eastern to Australia/etc.) I was having an issue that EST wasn't adjusting the date, only the time; whereas US/Eastern seems to roll the date by 1 as expected. Thanks! – VS_FF Mar 21 '17 at 16:25
1

I can help you only a very little bit - i don't know about the timezone conversions (although I suspect if you find the right options there it will fix this)

I do know why your 'quick and dirty' manual fix at the end doesn't work though. You can't use if, else or in when comparing pandas series like that.

this:

shortWnd = [datetime.date(2016, 3, 24), datetime.date(2016, 3, 23)] # etc....
dfSelected = df[(df.index.time >= datetime.time(15, 1) if (df.index.date in shortWnd) else (df.index.time >= datetime.time(14, 1)  ))][['val']]

should be something like:

dfSelected = df[((df.index.time >= datetime.time(15, 1) & (df.index.date.isin(shortWnd))) | (df.index.time >= datetime.time(14, 1)  ))][['val']]

But I may have misunderstood your logic.

You need to rewrite your if, else and in using & | and df.isin

good luck!

Stael
  • 2,619
  • 15
  • 19
  • Thanks for the tip. I wasn't aware of 'df.isin()' rather trying to use the standard 'in' which I guess is only applicable to scalars. This however doesn't seem to work, as using 'df.index.date.isin(shortWnd)' as you are suggesting throws an error 'AttributeError: 'numpy.ndarray' object has no attribute 'isin'' and if I instead try just 'df.index.isin(shortWnd)' then it runs without errors, but the doesn't produce the desired outcome -- guess always returns False. – VS_FF Mar 21 '17 at 12:37