49

I think this should be simple but what I've seen are techniques that involve iterating over a dataframe date fields to determine the diff between two dates. And I'm having trouble with it. I'm familiar with MSSQL DATEDIFF so I thought Pandas datetime would have something similar. I perhaps it does but I'm missing it.

Is there a Pandonic way of determing the number of months as an integer between two dates (datetime) without the need to iterate? Keep in mind that there potentially are millions of rows so performance is a consideration.

The dates are datetime objects and the result would like this - new column being Month:

Date1           Date2         Months
2016-04-07      2017-02-01    11
2017-02-01      2017-03-05    1
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
shavar
  • 665
  • 2
  • 7
  • 11

6 Answers6

102

Here is a very simple answer my friend:

df['nb_months'] = ((df.date2 - df.date1)/np.timedelta64(1, 'M'))

and now:

df['nb_months'] = df['nb_months'].astype(int)
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • I get a float when running this. It that what you expected? – shavar Mar 15 '17 at 23:56
  • 2
    just convert to integer with astype('int') bro – ℕʘʘḆḽḘ Mar 15 '17 at 23:56
  • 4
    df['month'] = ((df.date2 - df.date1) / np.timedelta64(1, 'M')).astype(int) does the trick. Completes quickly. Thanks bro. – shavar Mar 16 '17 at 00:01
  • not better bro, just different. the advantage of my solution is that the output is a float, not an object. – ℕʘʘḆḽḘ Jan 03 '18 at 20:36
  • 2
    I concur with discort. The other solution is better as it takes care of rounding. The .asType method suggested here fails with NaT rows (which you may get if you had just calculated a 'Next Date' field where the last row is always a NaT) – Reddspark Jan 21 '18 at 20:55
  • 10
    Beware: this rounds to, e.g., 0 months between Feb 1 and March 1 -- is that what you really want? It gives slightly more or less than a whole number of months depending on the months in question. For instance, `(pd.Timestamp('2018-03-01') - pd.Timestamp('2018-02-01')) / np.timedelta64(1, 'M') == 0.91993675`. @piRSquared's solution, or `.round()` is probably better. – Doctor J Sep 21 '18 at 23:28
  • If you read the comments you would notice I suggest to use `astype(int)` which completely solves this point. Adding this more explicitly. – ℕʘʘḆḽḘ Sep 23 '18 at 12:10
  • Just note that this returns 9 months for the first row, instead of OP's desired 11. I'd say that's correct - the interval is almost but not quite 10 months, and rounding down is usually what you want for timedeltas - but thought the discrepancy worth pointing out. – Mark Reed Sep 26 '18 at 17:06
  • You will face an issue with rounding with this method. Use df['nb_months'] = df['nb_months'].round(0).astype(int) – Nicolas Oct 22 '18 at 13:50
  • I prefer this https://stackoverflow.com/questions/49700898/pandas-date-range-subtracting-numpy-timedelta-gives-odd-result-time-becomes-n – TMrtSmith Feb 04 '19 at 12:28
  • 3
    Assuming you are running Python 3, you could use the // operator to do integer division to get the integer `df['nb_months'] = (df.date2 - df.date1) // np.timedelta64(1, 'M')` – Dr Fabio Gori Mar 04 '19 at 00:04
44

An alternative, possibly more elegant solution is df.Date2.dt.to_period('M') - df.Date1.dt.to_period('M'), which avoids rounding errors.

pberkes
  • 5,141
  • 1
  • 24
  • 22
  • 3
    I think this is the more correct answer, as rounding errors sure cause trouble. – Nils Apr 07 '19 at 13:26
  • 5
    to return Series of int, use following code; `from operator import attrgetter` `(df.Date2.dt.to_period('M') - df.Date1.dt.to_period('M')).to_period('M')).apply(attrgetter('n'))` as per [this post](https://stackoverflow.com/a/54465409) – h2ku Jan 08 '20 at 05:00
  • 2
    Doesn't work for pandas version > 0.24.0. See [this](https://stackoverflow.com/questions/40923820/pandas-timedelta-in-months) answer for updated code. – gherka Feb 06 '20 at 12:40
27
df.assign(
    Months=
    (df.Date2.dt.year - df.Date1.dt.year) * 12 +
    (df.Date2.dt.month - df.Date1.dt.month)
)

       Date1      Date2  Months
0 2016-04-07 2017-02-01      10
1 2017-02-01 2017-03-05       1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • This one works as well. I get 10 and 1 months. With @Noobie solution I get 9 and 1. So this one is inclusive. Depending on my needs for any particular project both are super useful. Thanks. – shavar Mar 16 '17 at 00:18
  • 2
    or simply: df["Months"] = (df.Date2.dt.year - df.Date1.dt.year) * 12 + (df.Date2.dt.month - df.Date1.dt.month) – Nicolas Oct 22 '18 at 13:51
8

This works with pandas 1.1.1:

df['Months'] = df['Date2'].dt.to_period('M').astype(int) - df['Date1'].dt.to_period('M').astype(int)

df

# Out[11]: 
#        Date1      Date2  Months
# 0 2016-04-07 2017-02-01      10
# 1 2017-02-01 2017-03-05       1
Pawel Kranzberg
  • 1,173
  • 15
  • 16
7

Just a small addition to @pberkes answer. In case you want the answer as integer values and NOT as pandas._libs.tslibs.offsets.MonthEnd, just append .n to the above code.

(pd.to_datetime('today').to_period('M') - pd.to_datetime('2020-01-01').to_period('M')).n
# [Out]:
# 7
aks
  • 121
  • 2
  • 3
4

There are two notions of difference in time, which are both correct in a certain sense. Let us compare the difference in months between July 31 and September 01:

import numpy as np
import pandas as pd

dtr = pd.date_range(start="2016-07-31", end="2016-09-01", freq="D")
delta1 = int((dtr[-1] - dtr[0])/np.timedelta64(1,'M'))
delta2 = (dtr[-1].to_period('M') - dtr[0].to_period('M')).n
print(delta1,delta2)

Using numpy's timedelta, delta1=1, which is correct given that there is only one month in between, but delta2=2, which is also correct given that September is still two months away in July. In most cases, both will give the same answer, but one might be more correct than the other given the context.

Nils
  • 818
  • 7
  • 16