I have a data set in CSV file, which I am loading into the Pandas DataFrame to be indexed by date + time. The original data is on the New York City time (although it's not reflected in the raw data) and I need to analyze it from the perspective of the London time.
I make the index TimeZone-aware for the EST TimeZone after loading it into the DataFrame and then convert it into the Europe/London TimeZone. It seems to work fine, but not for the dates that fall into the transition period, during which the time difference between London and New York is 4 hours, rather than 5 (like right now, in late March).
The data in CSV file looks like this:
2/15/16 10:00 1
2/15/16 10:01 2
3/24/16 10:00 3
3/24/16 10:01 4
3/24/16 10:02 5
3/24/16 10:03 6
3/24/16 11:00 7
3/24/16 11:01 8
And here is the simplified code:
import datetime
import pandas as pd
df = pd.read_csv('data/dates.csv', usecols=[0, 1, 2], header=None)
df.columns = ['dt', 'tm', 'val']
df.set_index(pd.DatetimeIndex(df.dt + ' ' + df.tm), inplace=True)
del df['dt']
del df['tm']
print(df)
df.index = df.index.tz_localize('EST', ambiguous='infer').tz_convert('Europe/London')
print(df)
Output from the first print()
statement is as follows:
val
2016-02-15 10:00:00 1
2016-02-15 10:01:00 2
2016-03-24 10:00:00 3
2016-03-24 10:01:00 4
2016-03-24 10:02:00 5
2016-03-24 10:03:00 6
2016-03-24 11:00:00 7
2016-03-24 11:01:00 8
And output from the second print()
is as follows:
val
2016-02-15 15:00:00+00:00 1
2016-02-15 15:01:00+00:00 2
2016-03-24 15:00:00+00:00 3
2016-03-24 15:01:00+00:00 4
2016-03-24 15:02:00+00:00 5
2016-03-24 15:03:00+00:00 6
2016-03-24 16:00:00+00:00 7
2016-03-24 16:01:00+00:00 8
My issue is that while the normal +5hour adjustment is done right for 15-Feb-2016, it's not done right for 24-March-2016, as that date falls into the transition period when the NYC-LDN time difference is 4 hours, rather than 5.
I have read numerous posts on how to do the adjustment right, and was hoping that the ambiguous='infer'
bit might help, but seems that I'm still doing it wrong?
Finally, if the above is difficult to achieve, I won't mind doing a manual workaround. What I ultimately need to do is to select the sub-set of the DataFrame into another DataFrame, based on hours and minutes (from the London time perspective). I'm currently doing it like this:
dfSelected = df[(df.index.time >= datetime.time(15, 1))][['val']]
I was thinking of trying something manual and ugly like this:
shortWnd = [datetime.date(2016, 3, 24), datetime.date(2016, 3, 23)] # etc....
dfSelected = df[(df.index.time >= datetime.time(15, 1) if (df.index.date in shortWnd) else (df.index.time >= datetime.time(14, 1) ))][['val']]
Whereby I'm adjusting the selection window manually, based on each row in the source DataFrame being within the transition period or not, but I'm getting this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any ideas on how to do this better?!