5

I am trying to convert a datetime object to datetime. In the original dataframe the data type is a string and the dataset has shape = (28000000, 26). Importantly, the format of the date is MMYYYY only. Here's a data sample:

                       DATE
Out[3]    0           081972
          1           051967
          2           101964
          3           041975
          4           071976

I tried:

df['DATE'].apply(pd.to_datetime(format='%m%Y'))

and

pd.to_datetime(df['DATE'],format='%m%Y')

I got Runtime Error both times

Then

df['DATE'].apply(pd.to_datetime)

it worked for the other not shown columns(with DDMMYYYY format), but generated future dates with df['DATE'] because it reads the dates as MMDDYY instead of MMYYYY.

            DATE
0       1972-08-19
1       2067-05-19
2       2064-10-19
3       1975-04-19
4       1976-07-19

Expect output:

          DATE
0       1972-08
1       1967-05
2       1964-10
3       1975-04
4       1976-07

If this question is a duplicate please direct me to the original one, I wasn't able to find any suitable answer.

Thank you all in advance for your help

Simone Di Claudio
  • 131
  • 1
  • 2
  • 8

2 Answers2

11

First if error is raised obviously some datetimes not match, you can test it by errors='coerce' parameter and Series.isna, because for not matched values are returned missing values:

print (df)
     DATE
0   81972
1   51967
2  101964
3   41975
4  171976 <-changed data


print (pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce'))
0   1972-08-01
1   1967-05-01
2   1964-10-01
3   1975-04-01
4          NaT
Name: DATE, dtype: datetime64[ns]

print (df[pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').isna()])
     DATE
4  171976

Solution with output from changed data with converting to datetimes and the to months periods by Series.dt.to_period:

df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
      DATE
0  1972-08
1  1967-05
2  1964-10
3  1975-04
4      NaT

Solution with original data:

df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)

0  1972-08
1  1967-05
2  1964-10
3  1975-04
4  1976-07
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

I would have done:

df['date_formatted'] = pd.to_datetime(
    dict(
        year=df['DATE'].str[2:], 
        month=df['DATE'].str[:2], 
        day=1
    )
)

Maybe this helps. Works for your sample data.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
naccode
  • 510
  • 1
  • 8
  • 18