Simplified huge df with date column of inconsistent string formatting containing errors:
df_length = 10000
df = pd.DataFrame({
"to_ignore": np.random.randint(1, 500, df_length),
"date": np.random.choice(["11 Nov 2018", "Feb 2019", "2021-11-02", "asdf"], df_length),
})
We need to convert date
col to datetime but can't find a solution that doesn't drop data or processes within a usable time. Tried formatting successively with errors='ignore'
:
df['date'] = pd.to_datetime(df['date'], format='%b %Y', errors='ignore')
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='ignore')
But with erroneous strings ("asdf") the col seems unaffected. Trying formats successively with errors='coerce'
obviously drops data.
We tried dateparser, df['date'] = df['date'].apply(lambda x: dateparser.parse(x))
, which kinda worked except it sometimes got date of month wrong (2019-02-02 should be 2019-02-01):
to_ignore date
0 115 2019-02-02
1 285 NaT
...
This is also prohibitively slow (play with df_length
).
What's a good way to do this?