0

my dataframe contains numerous incorrect datetime values that have been fat-fingered in by the people who entered those data. The errors are mostly 2019-11-12 was entered at 0019-12-12 and 2018 entered as 0018. There are so many of them, so I want came up with a script to correct them en mass. I used the following code:

df['A'].loc[df.A.dt.year<100]=df.A.dt.year+2000

Basically, I want tell python to detect any of the years less than 100 then add 2000 to the year. However, I am getting error :"Out of bounds nanosecond timestamp: 19-11-19 00:00:00" Is there any solution to my problem? Thanks

Raphaele Adjerad
  • 1,117
  • 6
  • 12
Billy
  • 33
  • 4

2 Answers2

1

This is because of the limitations of timestamps : see this post about out of bounds nanosecond timestamp.

Therefore, I suggest correcting the column as a string before turning it into a datetime column, as follows:

import pandas as pd
import re
df = pd.DataFrame({"A": ["2019-10-04", "0019-04-02", "0018-06-08", "2018-07-08"]})

# I look for every date starting with zero and another number and replace by 20
r = re.compile(r"^0[0-9]{1}")
df["A"] = df["A"].apply(lambda x: r.sub('20', x))
# then I transform to datetime
df["A"] = pd.to_datetime(df["A"], format='%Y-%m-%d')
df

Here is the result

          A
0 2019-10-04
1 2019-04-02
2 2018-06-08
3 2018-07-08

You need to make sure that you can only have dates in 20XX (where X is any number) and not dates in 19XX or other before applying this.

Raphaele Adjerad
  • 1,117
  • 6
  • 12
  • I am getting an error on df["A"] = df["A"].apply(lambda x: r.sub('20', x)). TypeError: expected string or bytes-like object. I guess it's because column "A" is datetime to begin with? – Billy Apr 12 '20 at 23:11
  • Try transforming it to a string before applying the `sub` with `df["A"] = df["A"].astype(str)`. – Raphaele Adjerad Apr 13 '20 at 05:16
0

An option would be to export to csv. Then make the changes and import again.

    df.to_csv('path/csvfile.csv')

    text = open("path/csvfile.csv", "r")
    text = ''.join([i for i in text]) \
        .replace("0019-", "2019-")
    x = open("path/newcsv.csv","w")
    x.writelines(text)
    x.close()

    df_new = pd.read_csv("path/newcsv.csv")

greylock
  • 89
  • 4