0

I have a dataset with mixed datetime formats and strings in the date columns. I am trying to standardize the date in the columns to a regular datetime format.

I tried combining these solutions (Clean a Messy Date Column with Mixed Formats in Pandas),(convert pandas datetime column yyyy-mm-dd to YYYYMMDD), (Working with mixed datetime formats in pandas).

I tried converting the string that has no separators first and then running the conversion function, but the I'm getting the wrong year (for example 20110912 is converted to 1970/01/01). There are so many inconsistencies, so I'm not sure which method would work best. Any suggestions would be greatly appreciated!

import numpy as np
import pandas as pd


df = pd.DataFrame({'date':[20110912.0, 20230102, '10/10/17', '4/8/14',
                           '7/28/2020', '20121001', 2023.01.02',
                           '2019-04-23 0:00:00', '2011-12-21 0:00:00', 
                           '07/28/14', '', 'NaN' ]})


s = df['date'].copy()
mask = s.apply(type) == float
s.loc[mask] = s.loc[mask].astype(int)

newdf = df.assign(date=pd.to_datetime(s.astype(str),errors='coerce').dt.date)

My code keeps throwing an error at the line to convert integers. ValueError: cannot convert float NaN to integer

  • Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), including a small example input data and the corresponding expected result. To find pathological cases in your `df`, try e.g.: `df.loc[pd.to_datetime(df['date'].astype(str), errors='coerce').isna(), 'date'].head()` and update your example to include some of those. – Pierre D Mar 19 '23 at 22:54

2 Answers2

1

pd.datetime() alone can figure it out. Add .dt.date to drop the time if needed, or .dt.strftime to specify a different date format.

import pandas as pd

df = pd.DataFrame({'date':['20110912', '20230102', '1/02/2023', '1/02/23',
                           '2023-01-02', '2023-01-02 12:59:59', '2023.01.02']})
print(df)
df.date = pd.to_datetime(df.date).dt.date
print(df)

Output:

                  date
0             20110912
1             20230102
2            1/02/2023
3              1/02/23
4           2023-01-02
5  2023-01-02 12:59:59
6           2023.01.02
         date
0  2011-09-12
1  2023-01-02
2  2023-01-02
3  2023-01-02
4  2023-01-02
5  2023-01-02
6  2023-01-02

(Note: provide test cases in your question if this doesn't work for you)

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • @lvlupcode [Edit](https://stackoverflow.com/posts/75776837/edit) your question with a [mcve] that outputs the problem. I have that value above in my MRE. – Mark Tolonen Mar 18 '23 at 15:56
  • That’s the same code as in the question. Please add a data frame like my example by editing the question, not a link. – Mark Tolonen Mar 18 '23 at 18:36
  • Gotcha! I added the dataframe for the MRE in the question. When I applied the string type to fix the integer issue I got an error that the value was out of range. ParserError: year 20170331 is out of range: 20170331.0 –  Mar 18 '23 at 19:17
0

Edit: After some forth and back, it seems like there really is quite a bit of noise in that data. To list a few kinds of noise that surfaced up so far:

  • mixed (but valid) date formats (e.g. '%Y-%m-%d %h:%M:%S', '%m/%D/%y', etc.);
  • dates with indication of time zone, others without;
  • int values such as 20230320;
  • float values such as 20230320.0;
  • random strings (e.g. 'NaN', '', ...);
  • missing values (NaN as float).

In this modified answer, we will:

  1. convert non-NaN floats to int;
  2. convert the whole column to type str;
  3. use pd.to_datetime(..., errors='coerce');
  4. see if any datetime is tz-aware; if so, use utc=True to bring all datetimes on a common timezone (UTC), then .dt.localize(None) to make them all tz-naive;
  5. inspect the values that are not NaN in the initial data, but NaT in the result, to review any opportunity to get more valid data through this cleaning process.

I should also mention some unexpected behavior (pandas versions tried: from 1.4.2 to 1.5.3) of pd.to_datetime in the result type as noted in this bug report (I did not expected to see float('NaN') and str('NaT') in the results when arg errors='coerce' is passed).

Reproducible example

We expand a bit on the OP's example to include some other formats and data issues:

df = pd.DataFrame({'date':[
    20110912.0, 20230102, '10/10/17', '4/8/14', '7/28/2020', '20121001',
    '2023.01.02', '2019-04-23 0:00:00', '2011-12-21 0:00:00', '07/28/14',
    'Mon Mar 20 11:03:10 UTC 2023', '', 'NaN', 'foo', float('NaN'),
]})

Cleaning process

from datetime import datetime

s = df['date'].copy()
mask = (s.apply(type) == float) & ~s.isna()
s.loc[mask] = s.loc[mask].astype(int)
s = s.astype(str)
s2 = pd.to_datetime(s, errors='coerce')

# see github.com/pandas-dev/pandas/issues/52094
is_bad = s2.apply(type) != datetime
s2.loc[is_bad] = pd.NaT

has_tz = (
    s2[~is_bad]
    .apply(datetime.tzname).astype(bool)
    .reindex(s.index, fill_value=False)
)
if has_tz.any():
    # we convert a second time, to get all datetimes
    # to a common tz: utc, then make all tz-naive
    s3 = pd.to_datetime(
        s, errors='coerce', utc=True).dt.tz_localize(None)
else:
    s3 = s2

# keep only dates; an alternatively would be to
# use .dt.normalize() to truncate down to mindnight
newdf = df.assign(date=s3.dt.date)

With this:

>>> newdf
         date
0  2011-09-12
1  2023-01-02
2  2017-10-10
3  2014-04-08
4  2020-07-28
5  2012-10-01
6  2023-01-02
7  2019-04-23
8  2011-12-21
9  2014-07-28
10 2023-03-20
11        NaT
12        NaT
13        NaT
14        NaT

And, for inspection:

>>> df.loc[is_bad, 'date'].apply(repr).value_counts()
''       1
'NaN'    1
'foo'    1
nan      1

Also, if you are curious about the intermediary values in the cleaning (and to inspect what happens to tz-aware datetimes):

>>> pd.concat([s, is_bad, s2, has_tz, s3], axis=1, keys='s is_bad s2 has_tz s3'.split())
                               s  is_bad                         s2  has_tz                  s3
0                       20110912   False        2011-09-12 00:00:00   False 2011-09-12 00:00:00
1                       20230102   False        2023-01-02 00:00:00   False 2023-01-02 00:00:00
2                       10/10/17   False        2017-10-10 00:00:00   False 2017-10-10 00:00:00
3                         4/8/14   False        2014-04-08 00:00:00   False 2014-04-08 00:00:00
4                      7/28/2020   False        2020-07-28 00:00:00   False 2020-07-28 00:00:00
5                       20121001   False        2012-10-01 00:00:00   False 2012-10-01 00:00:00
6                     2023.01.02   False        2023-01-02 00:00:00   False 2023-01-02 00:00:00
7             2019-04-23 0:00:00   False        2019-04-23 00:00:00   False 2019-04-23 00:00:00
8             2011-12-21 0:00:00   False        2011-12-21 00:00:00   False 2011-12-21 00:00:00
9                       07/28/14   False        2014-07-28 00:00:00   False 2014-07-28 00:00:00
10  Mon Mar 20 11:03:10 UTC 2023   False  2023-03-20 11:03:10+00:00    True 2023-03-20 11:03:10
11                                  True                        NaT   False                 NaT
12                           NaN    True                        NaT   False                 NaT
13                           foo    True                        NaT   False                 NaT
14                           nan    True                        NaT   False                 NaT
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • Thanks @Pierre! I've tried converting to a string first to address the int issue and I'm not getting errors that the values are out of range. ValueError: year 20110912 is out of range The above exception was the direct cause of the following exception: ParserError Traceback (most recent call last) ParserError: year 20110912 is out of range: 20110912.0 During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) TypeError: invalid string coercion to datetime –  Mar 19 '23 at 21:33
  • It still converts the integer to the wrong date, so still trying to figure out how to fix that part. –  Mar 19 '23 at 22:18
  • Thanks @Pierre - I posted the specific values from my data set/examples in my edited question. Where am I supposed to put them? I can't do that in the comments. –  Mar 20 '23 at 01:07
  • When I create a new dataframe with the values in my questions, which are from the data set, the code works, but if I try it on the field itself I get an error: ValueError: cannot convert float NaN to integer I'm not sure what is going wrong. I haven't done any other transformations on the column. –  Mar 20 '23 at 16:55
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/252632/discussion-between-lvlupcode-and-pierre-d). –  Mar 20 '23 at 21:06
  • Answer updated to handle most of the problems that surfaced so far from the discussion. – Pierre D Mar 21 '23 at 14:53