So the situation is this: I have a pandas dataframe with datetime index. This index, however, is in a string format, and I don't know which format is used, but I do know that it is the same one for every value. My obvious choice for parsing a datetime string without a known format would be to use a parse
function from dateutil.parser
. But there could be some ambiguities about some of the values, for example "2020-01-02"
could be both February 1st or January 2nd, so it could be either in a %Y-%m-%d
format or in a %Y-%d-%m
format. But in the same dataframe I also have a value of "2020-01-30"
which can only be parsed with %Y-%m-%d
format. So at that point I know that all of the values in this dataframe are actually in this particular format and I want to parse all of them the same.
The problem is, dateutil
function can only be applied to single values, and not to lists. As far as I understand, while dateutil.parser.parse
is capable of parsing myriads of datetime formats, it cannot give you back information about what format it used, so I cannot use the parsing information from one value (that can only be parsed in 1 way) and use it for the other (that can be parsed 2 or more ways)
My second idea was to write my own function that would use as many known datetime formats as possible and try to apply them to all values in the dataframe index one by one. At this point that monster was born:
def parse_df_index(df):
possible_datetime_formats = [
'%Y-%m-%d %H:%M:%S',
'%Y-%m-%d %H:%M',
'%d.%m.%Y %H:%M:%S',
'%d.%m.%Y %H:%M',
'%d/%m/%Y %H:%M:%S',
'%d/%m/%Y %H:%M',
'%m/%d/%Y %H:%M:%S',
'%m/%d/%Y %H:%M'
...
]
for date_format in possible_datetime_formats:
try:
index = [dt.datetime.strptime(t, date_format) for t in df.index]
except:
pass
df.index = index
return df
This approach, while dependent on my datetime format research (which didn't even scratch the surface of what dateutil function can do), has been broken by the string '2019-07-29 00:00:00+04:00'
, produced by none other than pandas itself. No matter how I tried, I couldnt find a way to use strptime
to parse this timezone format.
At that point I'm lost. I'm sure it's not such a rare situation I'm in, so someone has probably solved it? I hope...