There are already a lot of questions about that topic, but I could not find replies that solve my troubles.
1. The context
I have timestamps stored in a list as strings, which look like:
print(my_timestamps)
...
3 Sun Mar 31 2019 00:00:00 GMT+0100
4 Sun Mar 31 2019 01:00:00 GMT+0100
5 Sun Mar 31 2019 03:00:00 GMT+0200
6 Sun Mar 31 2019 04:00:00 GMT+0200
...
13 Sun Oct 27 2019 01:00:00 GMT+0200
14 Sun Oct 27 2019 02:00:00 GMT+0200
15 Sun Oct 27 2019 02:00:00 GMT+0100
16 Sun Oct 27 2019 03:00:00 GMT+0100
17 Sun Oct 27 2019 04:00:00 GMT+0100
Name: date, dtype: object
You will notice I have kept 2 zones where there are DST. I use to_datetime() to store it as timestamps in a panda dataframe
df['date'] = pd.to_datetime(my_timestamps)
print(df)
...
3 2019-03-31 00:00:00-01:00
4 2019-03-31 01:00:00-01:00
5 2019-03-31 03:00:00-02:00
6 2019-03-31 04:00:00-02:00
...
13 2019-10-27 01:00:00-02:00
14 2019-10-27 02:00:00-02:00
15 2019-10-27 02:00:00-01:00
16 2019-10-27 03:00:00-01:00
17 2019-10-27 04:00:00-01:00
Name: date, dtype: object
A 1st surprising (to me) thing is that 'date' column keeps its dtype as 'object' and not 'datetime64'.
When I want to use these timestamps as indexes with
df.set_index('date', inplace = True, verify_integrity = True)
I get an error with verify_integrity check informing me there are duplicate indexes.
ValueError: Index has duplicate keys: Index([2019-10-27 02:00:00-01:00, 2019-10-27 03:00:00-01:00], dtype='object', name='date')
I obviously would like to solve that.
2. What I tried
My understanding is that the timezone data is not used, and that to use it, I should try to convert the timestamps to have its dtype to 'datetime64'.
I first added the flag utc=True in to_datetime.
test = pd.to_datetime(my_timestamps,utc=True)
But then, I simply don't understand the result:
...
3 2019-03-31 01:00:00+00:00
4 2019-03-31 02:00:00+00:00
5 2019-03-31 05:00:00+00:00
6 2019-03-31 06:00:00+00:00
...
13 2019-10-27 03:00:00+00:00
14 2019-10-27 04:00:00+00:00
15 2019-10-27 03:00:00+00:00
16 2019-10-27 04:00:00+00:00
17 2019-10-27 05:00:00+00:00
According my understanding, timezone has been interpreted in a reversed manner ?!
3 Sun Mar 31 2019 00:00:00 GMT+0100
shifted in UTC time should read as
3 2019-03-30 23:00:00+00:00
but here it is translated into:
3 2019-03-31 01:00:00+00:00
This likely explains then the error of duplicate timestamps appearing
14 2019-10-27 04:00:00+00:00
...
16 2019-10-27 04:00:00+00:00
Please, has anyone any idea how to correctly handle the timezone information so that it doesn't lead to duplicate Index?
I thank you in advance for your help.
Have a good day, Bests,
Pierrot
PS: I am fine with having the timestamps expressed in UTC, as long as the shift in hour is correctly managed.
3. Edit
It would seem fromisoformat() function, new in Python 3.7, could help. However, it accepts as input a string. I am not certain how it can be used in a "vectorized" manner to apply it on a complete dataframee column.
How to convert a timezone aware string to datetime in python without dateutil?