1

I have dataframe like as below

cust_id,purchase_date
   1,10/01/1998
   1,10/12/1999
   2,13/05/2016
   3,14/02/2018
   3,15/03/2019

I would like to do the below

a) display the output in text format as 5 years and 9 months instead of 5.93244 etc.

I tried the below

from datetime import timedelta
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
gb = df_new.groupby(['unique_key'])
df_cust_age = gb['purchase_date'].agg(min_date=np.min, max_date=np.max).reset_index()
df_cust_age['diff_in_days'] = df_cust_age['max_date'] - df_cust_age['min_date']
df_cust_age['years_diff'] = df_cust_age['diff_in_days']/timedelta(days=365)

but the above code gives the output in decimal numbers.

I expect my output to be like as below

cust_id,years_diff
  1, 1 years and 11 months and 0 day
  2, 0 years
  3, 1 year and 1 month and 1 day
The Great
  • 7,215
  • 7
  • 40
  • 128
  • This not really a possible task to perform unambiguously, which is why pandas does not do it by default (while it does for days). There is no exact definition of year and month in terms of seconds (a year can be 365 or 366 days), similarly for a month. You need to define precisely how you would want to handle these ambiguities. – mozway Jun 02 '22 at 07:29
  • can pandas give just year and months? I don't need days... – The Great Jun 02 '22 at 07:30
  • As explained above, not unambiguously; **30 days** can be **1 months minus 1 day**, **1 months minus 2 days** or **exactly 1 month**, or **1 month and 1 day** – mozway Jun 02 '22 at 07:32

2 Answers2

1

If possible create 'default' month with 30 days use this custom function:

#https://stackoverflow.com/a/13756038/2901002

def td_format(td_object):
    seconds = int(td_object.total_seconds())
    periods = [
        ('year',        60*60*24*365),
        ('month',       60*60*24*30),
        ('day',         60*60*24),
        ('hour',        60*60),
        ('minute',      60),
        ('second',      1)
    ]

    strings=[]
    for period_name, period_seconds in periods:
        if seconds > period_seconds:
            period_value , seconds = divmod(seconds, period_seconds)
            has_s = 's' if period_value > 1 else ''
            strings.append("%s %s%s" % (period_value, period_name, has_s))

    return ", ".join(strings) if len(strings) > 0 else '0 year'


df_cust_age['years_diff'] = df_cust_age['diff_in_days'].apply(td_format)
print (df_cust_age)
   cust_id   min_date   max_date diff_in_days       years_diff
0        1 1998-10-01 1999-10-12     376 days  1 year, 11 days
1        2 2016-05-13 2016-05-13       0 days           0 year
2        3 2018-02-14 2019-03-15     394 days  1 year, 29 days
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1
from io import StringIO
import pandas as pd
from dateutil.relativedelta import relativedelta as RD


string_data = '''unique_key,purchase_date
   1,10/01/1998
   1,10/12/1999
   2,13/05/2016
   3,14/02/2018
   3,15/03/2019'''



## Custom functions
diff_obj = lambda d1,d2:RD(d1, d2) if d1>d2 else RD(d2, d1)
date_tuple = lambda diff:(diff.years,diff.months,diff.days)
pipeline = lambda row:date_tuple(diff_obj(row['min_date'],row['max_date']))

def string_format(date_tuple):
    final_string = []
    for val,name in zip(date_tuple,['years','months','day']):
        if val:
            final_string.append(f'{val} {name}')
    return ' and '.join(final_string) if final_string else '0 years'
## Custom functions


df = pd.read_csv(StringIO(string_data))
df['purchase_date'] = pd.to_datetime(df['purchase_date'],format='%d/%m/%Y')

gb = df.groupby(['unique_key'])
df_cust_age = gb['purchase_date'].agg(min_date=np.min, max_date=np.max).reset_index()


df_cust_age['years_diff'] = df_cust_age.apply(pipeline,axis=1).apply(string_format)

print(df_cust_age)
    unique_key  min_date    max_date    years_diff
0   1           1998-01-10  1999-12-10  1 years and 11 months
1   2           2016-05-13  2016-05-13  0 years
2   3           2018-02-14  2019-03-15  1 years and 1 months and 1 day
Mazhar
  • 1,044
  • 6
  • 11