I have a huge data with the following columns
|CategoryX|CategoryY|Time
|X a |Y a |2017-10-01 17:30:00
|X a |Y b |2017-10-09 12:30:00
|X b |Y a |2017-10-03 18:30:00
|X b |Y a |2017-10-05 16:30:00
|X b |Y a |2017-10-06 18:30:00
|X c |Y a |2017-09-01 17:25:00
|X c |Y b |2017-10-10 13:30:00
|X c |Y c |2017-10-01 11:30:00
|X c |Y c |2017-10-11 11:30:00
By doing
df=pd.DataFrame({'X':df['CategoryX'], 'Y':df['CategoryY'],'Time':df['Time']})
df=df.groupby(['X','Y'],as_index=False).size().reset_index(name="Count")
df = df.sort_values(['X', 'Count'], ascending=[1,0]).reset_index(drop=True)
The data is sorted into
|CategoryX|CategoryY|Count
|X a |Y a |1
|X a |Y b |1
|X b |Y a |3
|X c |Y a |1
|X c |Y b |1
|X c |Y c |2
However I need to calculate the frequency in the time column by
(Latest time - Earliest time)/Total count
If only one count then display the time, what am I missing for this to appear?
|CategoryX|CategoryY|Count|Frequency
|X a |Y a |1 |2017-10-01 17:30:00
|X a |Y b |1 |2017-10-09 12:30:00
|X b |Y a |3 |1 Day
|X c |Y a |1 |2017-09-01 17:25:00
|X c |Y b |1 |2017-10-10 13:30:00
|X c |Y c |2 |5 Days