11

Beginner python (and therefore pandas) user. I am trying to import some data into a pandas dataframe. One of the columns is the date, but in the format "YYYYMM". I have attempted to do what most forum responses suggest:

df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m')

This doesn't work though (ValueError: unconverted data remains: 3). The column actually includes an additional value for each year, with MM=13. The source used this row as an average of the past year. I am guessing to_datetime is having an issue with that.

Could anyone offer a quick solution, either to strip out all of the annual averages (those with the last two digits "13"), or to have to_datetime ignore them?

Mtd240
  • 115
  • 1
  • 1
  • 7
  • Have you tried to parse it while importing your data? For instance if you read a csv with `read_csv` in pandas, you have an argument which is `parse_dates = [your_col_name]` – ysearka Jul 20 '17 at 13:12

2 Answers2

12

pass errors='coerce' and then dropna the NaT rows:

df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m', errors='coerce').dropna()

The duff month values will get converted to NaT values

In[36]:
pd.to_datetime('201613', format='%Y%m', errors='coerce')

Out[36]: NaT

Alternatively you could filter them out before the conversion

df_cons['YYYYMM'] = pd.to_datetime(df_cons.loc[df_cons['YYYYMM'].str[-2:] != '13','YYYYMM'], format='%Y%m', errors='coerce')

although this could lead to alignment issues as the returned Series needs to be the same length so just passing errors='coerce' is a simpler solution

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Awesome thanks. It looks like the `.dropna()` didn't actually drop the `NaT` rows for me, but I think I can figure that part out. – Mtd240 Jul 20 '17 at 13:09
  • You may need to do that as a separate step, so after assignment then call `dropna()` – EdChum Jul 20 '17 at 13:10
0

Clean up the dataframe first.

df_cons = df_cons[~df_cons['YYYYMM'].str.endswith('13')]
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'])

May I suggest turning the column into a period index if YYYYMM column is unique in your dataset.

First turn YYYYMM into index, then convert it to monthly period.

df_cons = df_cons.reset_index().set_index('YYYYMM').to_period('M')
frogcoder
  • 963
  • 1
  • 7
  • 17