11

In my pandas dataframe I want to find the difference between dates in months. The function .dt.to_period('M') results in a MonthEnd object like <11 * MonthEnds> instead of the month number.

I tried to change the column type with pd.to_numeric() and to remove the letters with re.sub("[^0-9]", "", 'blablabla123bla'). Both do not work on a MonthEnd object.

df['duration_dataset'] = df['date_1'].dt.to_period('M') - df['date_2'].dt.to_period('M')

I expected 11, but the output is <11 * MonthEnds>.

Here is a minimum dataframe

d = {'date_1': ['2018-03-31','2018-09-30'], 'date_2': ['2017-12-31','2017-12-31']}
df = pd.DataFrame(data=d)

df['date_1'] = pd.to_datetime(df['date_1'], format='%Y-%m-%d')
df['date_2'] = pd.to_datetime(df['date_2'], format='%Y-%m-%d')

df['duration_dataset'] = df['date_1'].dt.to_period('M') - df['date_2'].dt.to_period('M')

df
Inge
  • 135
  • 1
  • 6
  • I find that behavior, pandas 0.24.0 – ALollz Jan 31 '19 at 16:39
  • 1
    You appear to have [`pd.MonthEnd()` objects](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.MonthEnd.html#pandas.tseries.offsets.MonthEnd), so these are [date offsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects). – Martijn Pieters Jan 31 '19 at 16:40
  • Can you please create a minimal dataframe that reproduces the issue? Include code to produce a frame with a few rows to show how the `MonthEnd` instances are created, so we can help correct that and get you the integer month count instead. – Martijn Pieters Jan 31 '19 at 16:42
  • Ah, I was testing with Pandas 0.23.4, upgrading now. – Martijn Pieters Jan 31 '19 at 16:44
  • I have Pandas version 0.24.0 – Inge Jan 31 '19 at 17:00
  • You could try to access the `MonthEnd` object by using `.__dict__['n']` – alphazeta Mar 30 '20 at 14:31

3 Answers3

19

This is new behaviour in Pandas 0.24, where subtracting Period() objects give you a DateOffset subclass.

You can get the numeric value from the DateOffset.n attribute:

from operator import attrgetter

df['duration_dataset'] = (
    df['date_1'].dt.to_period('M') -
    df['date_2'].dt.to_period('M')).apply(attrgetter('n'))

This produces

      date_1     date_2  duration_dataset
0 2018-03-31 2017-12-31                 3
1 2018-09-30 2017-12-31                 9

for your sample dataframe.

Rather than convert your dates to periods, you could instead convert them to a month count since the year 0, then subtract those numbers:

df['duration_dataset'] = (
    df['date_1'].dt.year * 12 + df['date_1'].dt.month - 1 -
    (df['date_2'].dt.year * 12 + df['date_2'].dt.month - 1)
)

which can be simplified to

df['duration_dataset'] = (
    12 * (df['date_1'].dt.year - df['date_2'].dt.year) +
    df['date_1'].dt.month - df['date_2'].dt.month
)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 2
    I've wasted about four hours of my life trying to figure this out. Thank you SOO much! – Sean McCarthy Apr 25 '19 at 20:39
  • There's no need to subtract by `1` when you do `df['duration_dataset'] = ( df['date_1'].dt.year * 12 + df['date_1'].dt.month - 1 - (df['date_2'].dt.year * 12 + df['date_2'].dt.month - 1) )`, right? – titusAdam Mar 24 '20 at 11:18
  • @titusAdam: in the direct subtraction here, the `-1` on both side of the equation can be eliminated, yes: `df['date_1'].dt.year * 12 + df['date_1'].dt.month - df['date_2'].dt.year * 12 - df['date_2'].dt.month`. – Martijn Pieters Mar 24 '20 at 11:26
8

You can also coerce Period to int using astype('int') and directly take the difference:

df['duration_dataset'] = df['date_1'].dt.to_period('M').astype('int') - df['date_2'].dt.to_period('M').astype('int')

It seems to be faster.

L. Francis Cong
  • 319
  • 2
  • 8
  • This now yields "FutureWarning: The behavior of .astype from period[M] to int32 is deprecated. In a future version, this astype will return exactly the specified dtype instead of int64, and will raise if that conversion overflows." Making something like this necessary: df['duration_dataset'] = (df['date_1'].dt.to_period('M') - df['date_2'].dt.to_period('M')).apply(lambda x: x.n) – John Anderson Nov 03 '22 at 18:20
  • Which version of pandas is that? – L. Francis Cong Nov 04 '22 at 21:03
  • 1.5.0 and 1.5.1 – John Anderson Nov 07 '22 at 17:11
1

Another alternative to access n attribute for the accepted answer for pandas 0.24 is

df['duration_dataset'].apply(lambda x: x.n)
  • 1
    This has problems with NaT – David Davó Jun 21 '22 at 18:25
  • 1
    @David Davó `if pd.notnull(x) else x` can be added. `df['duration_dataset'].apply(lambda x: x.n if pd.notnull(x) else x)` And if someone wanted to take the column from an Object to Int type `df['duration_dataset'].apply(lambda x: x.n if pd.notnull(x) else x).fillna(np.nan).astype('Int64')` – John Anderson Jan 24 '23 at 23:31