1

There are already a lot of questions about that topic, but I could not find replies that solve my troubles.

1. The context

I have timestamps stored in a list as strings, which look like:

print(my_timestamps)
...
3     Sun Mar 31 2019 00:00:00 GMT+0100
4     Sun Mar 31 2019 01:00:00 GMT+0100
5     Sun Mar 31 2019 03:00:00 GMT+0200
6     Sun Mar 31 2019 04:00:00 GMT+0200
...
13    Sun Oct 27 2019 01:00:00 GMT+0200
14    Sun Oct 27 2019 02:00:00 GMT+0200
15    Sun Oct 27 2019 02:00:00 GMT+0100
16    Sun Oct 27 2019 03:00:00 GMT+0100
17    Sun Oct 27 2019 04:00:00 GMT+0100
Name: date, dtype: object

You will notice I have kept 2 zones where there are DST. I use to_datetime() to store it as timestamps in a panda dataframe

df['date'] = pd.to_datetime(my_timestamps)
print(df)
...
3     2019-03-31 00:00:00-01:00
4     2019-03-31 01:00:00-01:00
5     2019-03-31 03:00:00-02:00
6     2019-03-31 04:00:00-02:00
...
13    2019-10-27 01:00:00-02:00
14    2019-10-27 02:00:00-02:00
15    2019-10-27 02:00:00-01:00
16    2019-10-27 03:00:00-01:00
17    2019-10-27 04:00:00-01:00
Name: date, dtype: object

A 1st surprising (to me) thing is that 'date' column keeps its dtype as 'object' and not 'datetime64'.

When I want to use these timestamps as indexes with

df.set_index('date', inplace = True, verify_integrity = True)

I get an error with verify_integrity check informing me there are duplicate indexes.

ValueError: Index has duplicate keys: Index([2019-10-27 02:00:00-01:00, 2019-10-27 03:00:00-01:00], dtype='object', name='date')

I obviously would like to solve that.

2. What I tried

My understanding is that the timezone data is not used, and that to use it, I should try to convert the timestamps to have its dtype to 'datetime64'.

I first added the flag utc=True in to_datetime.

test = pd.to_datetime(my_timestamps,utc=True)

But then, I simply don't understand the result:

...
3    2019-03-31 01:00:00+00:00
4    2019-03-31 02:00:00+00:00
5    2019-03-31 05:00:00+00:00
6    2019-03-31 06:00:00+00:00
...
13   2019-10-27 03:00:00+00:00
14   2019-10-27 04:00:00+00:00
15   2019-10-27 03:00:00+00:00
16   2019-10-27 04:00:00+00:00
17   2019-10-27 05:00:00+00:00

According my understanding, timezone has been interpreted in a reversed manner ?!

3     Sun Mar 31 2019 00:00:00 GMT+0100

shifted in UTC time should read as

3     2019-03-30 23:00:00+00:00

but here it is translated into:

3    2019-03-31 01:00:00+00:00

This likely explains then the error of duplicate timestamps appearing

14   2019-10-27 04:00:00+00:00
...
16   2019-10-27 04:00:00+00:00

Please, has anyone any idea how to correctly handle the timezone information so that it doesn't lead to duplicate Index?

I thank you in advance for your help.

Have a good day, Bests,

Pierrot

PS: I am fine with having the timestamps expressed in UTC, as long as the shift in hour is correctly managed.

3. Edit

It would seem fromisoformat() function, new in Python 3.7, could help. However, it accepts as input a string. I am not certain how it can be used in a "vectorized" manner to apply it on a complete dataframee column.

How to convert a timezone aware string to datetime in python without dateutil?

pierre_j
  • 895
  • 2
  • 11
  • 26

1 Answers1

0

So there does be a trouble in dateutil as indicated above. I reversed +/- sign in my original data file as indicated here:

How to replace a sub-string conditionally in a pandas dataframe column?

Bests, Pierrot

pierre_j
  • 895
  • 2
  • 11
  • 26