A script to correct corrupted date values

Question

my dataframe contains numerous incorrect datetime values that have been fat-fingered in by the people who entered those data. The errors are mostly 2019-11-12 was entered at 0019-12-12 and 2018 entered as 0018. There are so many of them, so I want came up with a script to correct them en mass. I used the following code:

df['A'].loc[df.A.dt.year<100]=df.A.dt.year+2000

Basically, I want tell python to detect any of the years less than 100 then add 2000 to the year. However, I am getting error :"Out of bounds nanosecond timestamp: 19-11-19 00:00:00" Is there any solution to my problem? Thanks

score 1 · Accepted Answer · answered Apr 12 '20 at 19:58

This is because of the limitations of timestamps : see this post about out of bounds nanosecond timestamp.

Therefore, I suggest correcting the column as a string before turning it into a datetime column, as follows:

import pandas as pd
import re
df = pd.DataFrame({"A": ["2019-10-04", "0019-04-02", "0018-06-08", "2018-07-08"]})

# I look for every date starting with zero and another number and replace by 20
r = re.compile(r"^0[0-9]{1}")
df["A"] = df["A"].apply(lambda x: r.sub('20', x))
# then I transform to datetime
df["A"] = pd.to_datetime(df["A"], format='%Y-%m-%d')
df

Here is the result

          A
0 2019-10-04
1 2019-04-02
2 2018-06-08
3 2018-07-08

You need to make sure that you can only have dates in 20XX (where X is any number) and not dates in 19XX or other before applying this.

I am getting an error on df["A"] = df["A"].apply(lambda x: r.sub('20', x)). TypeError: expected string or bytes-like object. I guess it's because column "A" is datetime to begin with? — Billy, Apr 12 '20 at 23:11
Try transforming it to a string before applying the `sub` with `df["A"] = df["A"].astype(str)`. — Raphaele Adjerad, Apr 13 '20 at 05:16

score 0 · Answer 2 · answered Apr 12 '20 at 20:47

An option would be to export to csv. Then make the changes and import again.

    df.to_csv('path/csvfile.csv')

    text = open("path/csvfile.csv", "r")
    text = ''.join([i for i in text]) \
        .replace("0019-", "2019-")
    x = open("path/newcsv.csv","w")
    x.writelines(text)
    x.close()

    df_new = pd.read_csv("path/newcsv.csv")

A script to correct corrupted date values

2 Answers2