In general, if you know the format of a string, you don't need to use dateutil.parser.parse
to parse it, because you can use datetime.strptime
with a specified string.
In this case, the only slightly unfortunate thing is that you have 2-digit years, some of which are from before 2000. In this case, I'd probably do something like this:
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
Once you've done that, you can use one of the solutions from this question (specifically this one) to convert your individual datetime columns into pandas Timestamps / datetimes.
If these are in UTC, you then use pandas.Series.tz_localize
with 'UTC'
to get timezone-aware datetimes.
Putting it all together:
import pandas as pd
df = pd.DataFrame(
[[98, 12, 5, 11],
[98, 12, 5, 10],
[4, 12, 5, 00]],
columns=['YY', 'MM', 'DD', 'HH'])
# Convert 2-digit years to 4-digit years
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
# Retrieve the date columns and rename them
col_renames = {'YY': 'year', 'MM': 'month', 'DD': 'day', 'HH': 'hour'}
dt_subset = df.loc[:, list(col_renames.keys())].rename(columns=col_renames)
dt_series = pd.to_datetime(dt_subset)
# Convert to UTC
dt_series = dt_series.dt.tz_localize('UTC')
# Result:
# 0 1998-12-05 11:00:00+00:00
# 1 1998-12-05 10:00:00+00:00
# 2 2004-12-05 00:00:00+00:00
# dtype: datetime64[ns, UTC]
Also, to clarify two things about this statement:
I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.
- The correct documentation for
python-dateutil
is now https://dateutil.readthedocs.io.
- If you are using
parse
, in your situation there is no reason to add UTC into a dictionary and pass it to tzinfos
. If you know that your datetimes are going to be naive but that they represent times in UTC, parse them as normal to get naive datetimes, then use datetime.replace(dateutil.tz.tzutc())
to get aware datetimes. The tzinfos
dictionary is for when the timezone information is actually represented in the string.
An example of what to do when you have strings representing UTC that don't contain timezone information:
from dateutil.parser import parse
from dateutil import tz
dt = parse('1998-12-05 11:00')
dt = dt.replace(tzinfo=tz.tzutc())