Convert datetime pandas

Question

Below is a sample of my df

date                   value

0006-03-01 00:00:00    1   
0006-03-15 00:00:00    2   
0006-05-15 00:00:00    1   
0006-07-01 00:00:00    3   
0006-11-01 00:00:00    1   
2009-05-20 00:00:00    2   
2009-05-25 00:00:00    8   
2020-06-24 00:00:00    1   
2020-06-30 00:00:00    2   
2020-07-01 00:00:00    13  
2020-07-15 00:00:00    2   
2020-08-01 00:00:00    4   
2020-10-01 00:00:00    2   
2020-11-01 00:00:00    4    
2023-04-01 00:00:00    1   
2218-11-12 10:00:27    1   
4000-01-01 00:00:00    6 
5492-04-15 00:00:00    1    
5496-03-15 00:00:00    1    
5589-12-01 00:00:00    1    
7199-05-15 00:00:00    1    
9186-12-30 00:00:00    1

As you can see, the data contains some misspelled dates.

Questions:

How can we convert this column to format dd.mm.yyyy?
How can we replace rows when Year greater than 2022? by 01.01.2100
How can we Remove All rows when Year less than 2005?

The final output should look like this.

date                   value


20.05.2009    2   
25.05.2009     8   
26.04.2020     1   
30.06.2020     2   
01.07.2020     13  
15.07.2020     2   
01.08.2020    4   
01.10.2020    2   
01.11.2020    4    
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1   
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1

I tried to convert the column using to_datetime but it failed.

df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

Out of bounds nanosecond timestamp: 5-03-01 00:00:00

Thanks to anyone helping!

The out of bounds error is because Pandas represents timestamps in nanosecond resolution: https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a — L.Clarkson, May 16 '20 at 22:46

FObersteiner · Accepted Answer · 2020-05-18T09:01:08.030

You could check the first element of your datetime strings after a split on '-' and clean up / replace based on its integer value. For the small values like '0006', calling pd.to_datetime with errors='coerce' will do the trick. It will leave 'NaT' for the invalid dates. You can drop those with dropna(). Example:

import pandas as pd

df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
                            '0006-03-15 00:00:00',
                            '0006-05-15 00:00:00',
                            '0006-07-01 00:00:00',
                            '0006-11-01 00:00:00',
                            'nan',
                            '2009-05-25 00:00:00',
                            '2020-06-24 00:00:00',
                            '2020-06-30 00:00:00',
                            '2020-07-01 00:00:00',
                            '2020-07-15 00:00:00',
                            '2020-08-01 00:00:00',
                            '2020-10-01 00:00:00',
                            '2020-11-01 00:00:00',
                            '2023-04-01 00:00:00',
                            '2218-11-12 10:00:27',
                            '4000-01-01 00:00:00',
                            'NaN',
                            '5496-03-15 00:00:00',
                            '5589-12-01 00:00:00',
                            '7199-05-15 00:00:00',
                            '9186-12-30 00:00:00']})

# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]

# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'

# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0           NaT
# 1           NaT
# 2           NaT
# 3           NaT
# 4           NaT
# 5    2009-05-20
# 6    2009-05-25
# ...

df = df.dropna()
# df
#          date
# 6  2009-05-25
# 7  2020-06-24
# 8  2020-06-30
# 9  2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...

Thanks@MrFuppes! It's doing the job. Besides, I was wondering how can I adjust your code if the DF contains NaN (wihtout removing them).. Because when applying your code it gives me the following error: invalid literal for int() with base 10: 'nan' — A2N15, May 18 '20 at 08:03
@Annis15 You mean the 'date' column initially contains strings 'nan'? — FObersteiner, May 18 '20 at 08:30

score 0 · Answer 2 · answered May 16 '20 at 23:20

Due to the limitations of pandas, the out of bounds error is thrown (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). This code will remove values that would cause this error before creating the dataframe.

import datetime as dt

import pandas as pd

data = [[dt.datetime(year=2022, month=3, day=1), 1],
        [dt.datetime(year=2009, month=5, day=20), 2],
        [dt.datetime(year=2001, month=5, day=20), 2],
        [dt.datetime(year=2023, month=12, day=30), 3],
        [dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]

df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
        date  Value
0 2022-03-01      1
1 2009-05-20      2
2 2001-05-20      2
3 2023-12-30      3

df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01      1
1 2009-05-20      2
3 2100-01-01      3

If you still want to include the dates that throw the out of bounds error, check out How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?

score 0 · Answer 3 · answered May 16 '20 at 23:21

I suggest the following:

df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
                                      '7199-05-15 00:00:00',
                                      '2020-10-21 00:00:00'],
                             'value': [1, 2, 3]})

df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
              else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]

df.dropna(inplace = True)

This yields the desired output:

date        value
01.01.2100  2
21.10.2020  3

Convert datetime pandas

3 Answers3

Linked