0

So the situation is this: I have a pandas dataframe with datetime index. This index, however, is in a string format, and I don't know which format is used, but I do know that it is the same one for every value. My obvious choice for parsing a datetime string without a known format would be to use a parse function from dateutil.parser. But there could be some ambiguities about some of the values, for example "2020-01-02" could be both February 1st or January 2nd, so it could be either in a %Y-%m-%d format or in a %Y-%d-%m format. But in the same dataframe I also have a value of "2020-01-30" which can only be parsed with %Y-%m-%d format. So at that point I know that all of the values in this dataframe are actually in this particular format and I want to parse all of them the same.

The problem is, dateutil function can only be applied to single values, and not to lists. As far as I understand, while dateutil.parser.parse is capable of parsing myriads of datetime formats, it cannot give you back information about what format it used, so I cannot use the parsing information from one value (that can only be parsed in 1 way) and use it for the other (that can be parsed 2 or more ways)

My second idea was to write my own function that would use as many known datetime formats as possible and try to apply them to all values in the dataframe index one by one. At this point that monster was born:

def parse_df_index(df):    
    possible_datetime_formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d %H:%M',
            '%d.%m.%Y %H:%M:%S',
            '%d.%m.%Y %H:%M',
            '%d/%m/%Y %H:%M:%S',
            '%d/%m/%Y %H:%M',
            '%m/%d/%Y %H:%M:%S',
            '%m/%d/%Y %H:%M'
            ...
        ]
    for date_format in possible_datetime_formats:
        try:
            index = [dt.datetime.strptime(t, date_format) for t in df.index]
        except:
            pass
    df.index = index
    return df

This approach, while dependent on my datetime format research (which didn't even scratch the surface of what dateutil function can do), has been broken by the string '2019-07-29 00:00:00+04:00', produced by none other than pandas itself. No matter how I tried, I couldnt find a way to use strptime to parse this timezone format.

At that point I'm lost. I'm sure it's not such a rare situation I'm in, so someone has probably solved it? I hope...

Anna
  • 199
  • 1
  • 10
  • This looks very similar to [this question](https://stackoverflow.com/questions/55247063/pandas-datetimes-with-different-formats-in-the-same-column). Does this solve the problem? – Rawson Apr 18 '22 at 16:04
  • It's the opposite question. This guy has multiple datetime formats in one column, and he knows what are the formats. I only have one, but it is unknown – Anna Apr 19 '22 at 09:10

1 Answers1

0

Have you tried with pd.to_datetime()? This allows the function to utilize the infer_datetime_format argument conveniently.

df['date_parsed'] = pd.to_datetime(df['raw_date'],infer_datetime_format=True)

If some specific lines have a very particular format which might cause errors, you can also use the errors argument.

Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
  • This approach also does not use a single string format for every value. Example: `df = pd.DataFrame({ 'datetime': ['01-02-2020', '15-02-2020'] }) ` `pd.to_datetime(df.datetime, infer_datetime_format=True)` gives this result: 0 2020-01-02 1 2020-02-15 It used %m-%d-%Y format for the first value and %d-%m-%Y for the second – Anna Apr 19 '22 at 08:22