4

Is there an equivalent to SQL's datediff function in Python's pandas? The answer to this question: Add column with number of days between dates in DataFrame pandas explains how to calculate the difference in days. For example:

>>> (pd.to_datetime('15-10-2010') - pd.to_datetime('15-07-2010')) / pd.offsets.Day(1)
92.0

However, I have two questions:

  1. Is there a way to calculate the difference in months? I can approximate dividing the result above by 30, by 31, but I was wondering if there is some built-in function that does this automatically.
  2. what is the syntax of pd.offsets? I tried dividing by pd.offsets.Month(1) and it doesn't work. I looked up the documentation here (which is atrocious, like all of Python's documentation!): http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects but Day() is not there, so I'm confused
Community
  • 1
  • 1
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • Also, I cannot seem to convert the result into an integer. Any ideas? – Pythonista anonymous May 06 '16 at 18:03
  • Mmm, I found here http://stackoverflow.com/questions/22132525/add-column-with-number-of-days-between-dates-in-dataframe-pandas how to convert from timedelta to float: divide by np.timedelta64(1,'D'). Why, and where it is documented, are separate questions!!! – Pythonista anonymous May 06 '16 at 18:09

3 Answers3

3

UPDATE:

def months_between(d1, d2):
    dd1 = min(d1, d2)
    dd2 = max(d1, d2)
    return (dd2.year - dd1.year)*12 + dd2.month - dd1.month

In [125]: months_between(pd.to_datetime('2015-01-02 12:13:14'), pd.to_datetime('2012-03-02 12:13:14'))
Out[125]: 34

OLD answer:

In [40]: (pd.to_datetime('15-10-2010') - pd.to_datetime('15-07-2010')).days
Out[40]: 92

you can also do this for months:

In [48]: pd.to_datetime('15-10-2010').month - pd.to_datetime('15-07-2010').month
Out[48]: 3
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
3

If you look around a little, it seems that months is not possible to get out of a TimeDelta:

In [193]: date_1 = pd.to_datetime('2015-01-02 12:13:14')

In [194]: date_2 = pd.to_datetime('2012-03-02 12:13:14')

In [195]: date_1 - date_2
Out[195]: Timedelta('1036 days 00:00:00')

In [199]: td_1.
td_1.asm8            td_1.days            td_1.freq            td_1.microseconds    td_1.resolution      td_1.to_pytimedelta  td_1.value           
td_1.ceil            td_1.delta           td_1.is_populated    td_1.min             td_1.round           td_1.to_timedelta64  td_1.view            
td_1.components      td_1.floor           td_1.max             td_1.nanoseconds     td_1.seconds         td_1.total_seconds

In [199]: td_1.components
Out[199]: Components(days=1036, hours=0, minutes=0, seconds=0, milliseconds=0, microseconds=0, nanoseconds=0)

Additionally, Components are not offering different denominations of the same value seemingly, but

In [213]: td_1.components.days
Out[213]: 1036

In [214]: td_1.components.hours
Out[214]: 0

Ultimately, it seems that what you have been doing until now seems like the "best" solution:

In [214]: td_1.components.days/30
Out[214]: 34.53333333333333

In [215]: np.round(td_1.components.days/30)
Out[215]: 35.0

In [216]: np.floor(td_1.components.days/30)
Out[216]: 34.0

Not the great news really, but a solution in any case.

As to comparing the documentation that Matlab comes with to this of pandas, you are right. However, if you were to compare the price tag of the two as well maybe some questions are answered.. (?)

Thanos
  • 2,472
  • 1
  • 16
  • 33
  • 1
    Sure, but my point is that, in a business environment, time is money, and the 'free'price of a poorly documented product is a false economy. I have just wasted about an hour trying to figure out how to subtract dates. If there existed a commercial software with roughly the functionalities of pandas and the quality of documentation of Matlab, i'd buy it in a heartbeat – Pythonista anonymous May 06 '16 at 18:11
0

There is also pd.Period, which you can use to get the difference for any time period.

In [1]: d1 = pd.Period('2022-08-31', 'M')
        d1
Out[1]: Period('2022-08', 'M')

In [2]: d2 = pd.Period('2020-03-25', 'M')
        d2
Out[2]: Period('2020-03', 'M')

In [3]: diff = d1 - d2
        diff
Out[3]: <29 * MonthEnds>

In [4]: diff.n
Out[4]: 29

Here's a generic function that you can use with apply, to turn any date column into your desired period:

def keep_period(x, period='M'):

    if pd.isna(x):
        return x
    else:
        return pd.to_datetime(x).to_period(period)

and to call it:

df['date_column_period'] = df['date_column'].apply(keep_period, args=('Y', ))
firefly
  • 301
  • 2
  • 7