Edit: After some forth and back, it seems like there really is quite a bit of noise in that data. To list a few kinds of noise that surfaced up so far:
- mixed (but valid) date formats (e.g.
'%Y-%m-%d %h:%M:%S'
, '%m/%D/%y'
, etc.);
- dates with indication of time zone, others without;
int
values such as 20230320
;
float
values such as 20230320.0
;
- random strings (e.g.
'NaN', '', ...
);
- missing values (
NaN
as float
).
In this modified answer, we will:
- convert non-NaN floats to
int
;
- convert the whole column to type
str
;
- use
pd.to_datetime(..., errors='coerce')
;
- see if any datetime is tz-aware; if so, use
utc=True
to bring all datetimes on a common timezone (UTC), then .dt.localize(None)
to make them all tz-naive;
- inspect the values that are not
NaN
in the initial data, but NaT
in the result, to review any opportunity to get more valid data through this cleaning process.
I should also mention some unexpected behavior (pandas
versions tried: from 1.4.2 to 1.5.3) of pd.to_datetime
in the result type as noted in this bug report (I did not expected to see float('NaN')
and str('NaT')
in the results when arg errors='coerce'
is passed).
Reproducible example
We expand a bit on the OP's example to include some other formats and data issues:
df = pd.DataFrame({'date':[
20110912.0, 20230102, '10/10/17', '4/8/14', '7/28/2020', '20121001',
'2023.01.02', '2019-04-23 0:00:00', '2011-12-21 0:00:00', '07/28/14',
'Mon Mar 20 11:03:10 UTC 2023', '', 'NaN', 'foo', float('NaN'),
]})
Cleaning process
from datetime import datetime
s = df['date'].copy()
mask = (s.apply(type) == float) & ~s.isna()
s.loc[mask] = s.loc[mask].astype(int)
s = s.astype(str)
s2 = pd.to_datetime(s, errors='coerce')
# see github.com/pandas-dev/pandas/issues/52094
is_bad = s2.apply(type) != datetime
s2.loc[is_bad] = pd.NaT
has_tz = (
s2[~is_bad]
.apply(datetime.tzname).astype(bool)
.reindex(s.index, fill_value=False)
)
if has_tz.any():
# we convert a second time, to get all datetimes
# to a common tz: utc, then make all tz-naive
s3 = pd.to_datetime(
s, errors='coerce', utc=True).dt.tz_localize(None)
else:
s3 = s2
# keep only dates; an alternatively would be to
# use .dt.normalize() to truncate down to mindnight
newdf = df.assign(date=s3.dt.date)
With this:
>>> newdf
date
0 2011-09-12
1 2023-01-02
2 2017-10-10
3 2014-04-08
4 2020-07-28
5 2012-10-01
6 2023-01-02
7 2019-04-23
8 2011-12-21
9 2014-07-28
10 2023-03-20
11 NaT
12 NaT
13 NaT
14 NaT
And, for inspection:
>>> df.loc[is_bad, 'date'].apply(repr).value_counts()
'' 1
'NaN' 1
'foo' 1
nan 1
Also, if you are curious about the intermediary values in the cleaning (and to inspect what happens to tz-aware datetimes):
>>> pd.concat([s, is_bad, s2, has_tz, s3], axis=1, keys='s is_bad s2 has_tz s3'.split())
s is_bad s2 has_tz s3
0 20110912 False 2011-09-12 00:00:00 False 2011-09-12 00:00:00
1 20230102 False 2023-01-02 00:00:00 False 2023-01-02 00:00:00
2 10/10/17 False 2017-10-10 00:00:00 False 2017-10-10 00:00:00
3 4/8/14 False 2014-04-08 00:00:00 False 2014-04-08 00:00:00
4 7/28/2020 False 2020-07-28 00:00:00 False 2020-07-28 00:00:00
5 20121001 False 2012-10-01 00:00:00 False 2012-10-01 00:00:00
6 2023.01.02 False 2023-01-02 00:00:00 False 2023-01-02 00:00:00
7 2019-04-23 0:00:00 False 2019-04-23 00:00:00 False 2019-04-23 00:00:00
8 2011-12-21 0:00:00 False 2011-12-21 00:00:00 False 2011-12-21 00:00:00
9 07/28/14 False 2014-07-28 00:00:00 False 2014-07-28 00:00:00
10 Mon Mar 20 11:03:10 UTC 2023 False 2023-03-20 11:03:10+00:00 True 2023-03-20 11:03:10
11 True NaT False NaT
12 NaN True NaT False NaT
13 foo True NaT False NaT
14 nan True NaT False NaT