0

Below is a sample of my df

date                   value

0006-03-01 00:00:00    1   
0006-03-15 00:00:00    2   
0006-05-15 00:00:00    1   
0006-07-01 00:00:00    3   
0006-11-01 00:00:00    1   
2009-05-20 00:00:00    2   
2009-05-25 00:00:00    8   
2020-06-24 00:00:00    1   
2020-06-30 00:00:00    2   
2020-07-01 00:00:00    13  
2020-07-15 00:00:00    2   
2020-08-01 00:00:00    4   
2020-10-01 00:00:00    2   
2020-11-01 00:00:00    4    
2023-04-01 00:00:00    1   
2218-11-12 10:00:27    1   
4000-01-01 00:00:00    6 
5492-04-15 00:00:00    1    
5496-03-15 00:00:00    1    
5589-12-01 00:00:00    1    
7199-05-15 00:00:00    1    
9186-12-30 00:00:00    1  

As you can see, the data contains some misspelled dates.

Questions:

  • How can we convert this column to format dd.mm.yyyy?
  • How can we replace rows when Year greater than 2022? by 01.01.2100
  • How can we Remove All rows when Year less than 2005?

The final output should look like this.

date                   value


20.05.2009    2   
25.05.2009     8   
26.04.2020     1   
30.06.2020     2   
01.07.2020     13  
15.07.2020     2   
01.08.2020    4   
01.10.2020    2   
01.11.2020    4    
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1   
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1   

I tried to convert the column using to_datetime but it failed.

df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

Out of bounds nanosecond timestamp: 5-03-01 00:00:00

Thanks to anyone helping!

A2N15
  • 595
  • 4
  • 20
  • The out of bounds error is because Pandas represents timestamps in nanosecond resolution: https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a – L.Clarkson May 16 '20 at 22:46

3 Answers3

1

You could check the first element of your datetime strings after a split on '-' and clean up / replace based on its integer value. For the small values like '0006', calling pd.to_datetime with errors='coerce' will do the trick. It will leave 'NaT' for the invalid dates. You can drop those with dropna(). Example:

import pandas as pd

df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
                            '0006-03-15 00:00:00',
                            '0006-05-15 00:00:00',
                            '0006-07-01 00:00:00',
                            '0006-11-01 00:00:00',
                            'nan',
                            '2009-05-25 00:00:00',
                            '2020-06-24 00:00:00',
                            '2020-06-30 00:00:00',
                            '2020-07-01 00:00:00',
                            '2020-07-15 00:00:00',
                            '2020-08-01 00:00:00',
                            '2020-10-01 00:00:00',
                            '2020-11-01 00:00:00',
                            '2023-04-01 00:00:00',
                            '2218-11-12 10:00:27',
                            '4000-01-01 00:00:00',
                            'NaN',
                            '5496-03-15 00:00:00',
                            '5589-12-01 00:00:00',
                            '7199-05-15 00:00:00',
                            '9186-12-30 00:00:00']})

# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]

# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'

# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0           NaT
# 1           NaT
# 2           NaT
# 3           NaT
# 4           NaT
# 5    2009-05-20
# 6    2009-05-25
# ...

df = df.dropna()
# df
#          date
# 6  2009-05-25
# 7  2020-06-24
# 8  2020-06-30
# 9  2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • Thanks@MrFuppes! It's doing the job. Besides, I was wondering how can I adjust your code if the DF contains NaN (wihtout removing them).. Because when applying your code it gives me the following error: invalid literal for int() with base 10: 'nan' – A2N15 May 18 '20 at 08:03
  • @Annis15 You mean the 'date' column initially contains strings 'nan'? – FObersteiner May 18 '20 at 08:30
0

Due to the limitations of pandas, the out of bounds error is thrown (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). This code will remove values that would cause this error before creating the dataframe.

import datetime as dt

import pandas as pd

data = [[dt.datetime(year=2022, month=3, day=1), 1],
        [dt.datetime(year=2009, month=5, day=20), 2],
        [dt.datetime(year=2001, month=5, day=20), 2],
        [dt.datetime(year=2023, month=12, day=30), 3],
        [dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]

df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
        date  Value
0 2022-03-01      1
1 2009-05-20      2
2 2001-05-20      2
3 2023-12-30      3

df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01      1
1 2009-05-20      2
3 2100-01-01      3

If you still want to include the dates that throw the out of bounds error, check out How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?

L.Clarkson
  • 492
  • 3
  • 12
0

I suggest the following:

df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
                                      '7199-05-15 00:00:00',
                                      '2020-10-21 00:00:00'],
                             'value': [1, 2, 3]})

df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
              else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]

df.dropna(inplace = True)

This yields the desired output:

date        value
01.01.2100  2
21.10.2020  3
Michael Hodel
  • 2,845
  • 1
  • 5
  • 10