1

I have a huge data with the following columns

|CategoryX|CategoryY|Time
|X a      |Y a      |2017-10-01 17:30:00
|X a      |Y b      |2017-10-09 12:30:00
|X b      |Y a      |2017-10-03 18:30:00
|X b      |Y a      |2017-10-05 16:30:00
|X b      |Y a      |2017-10-06 18:30:00
|X c      |Y a      |2017-09-01 17:25:00
|X c      |Y b      |2017-10-10 13:30:00
|X c      |Y c      |2017-10-01 11:30:00
|X c      |Y c      |2017-10-11 11:30:00

By doing

df=pd.DataFrame({'X':df['CategoryX'], 'Y':df['CategoryY'],'Time':df['Time']})
df=df.groupby(['X','Y'],as_index=False).size().reset_index(name="Count")
df = df.sort_values(['X', 'Count'], ascending=[1,0]).reset_index(drop=True)

The data is sorted into

|CategoryX|CategoryY|Count
|X a      |Y a      |1
|X a      |Y b      |1
|X b      |Y a      |3
|X c      |Y a      |1
|X c      |Y b      |1
|X c      |Y c      |2

However I need to calculate the frequency in the time column by

(Latest time - Earliest time)/Total count

If only one count then display the time, what am I missing for this to appear?

|CategoryX|CategoryY|Count|Frequency
|X a      |Y a      |1    |2017-10-01 17:30:00
|X a      |Y b      |1    |2017-10-09 12:30:00
|X b      |Y a      |3    |1 Day
|X c      |Y a      |1    |2017-09-01 17:25:00
|X c      |Y b      |1    |2017-10-10 13:30:00
|X c      |Y c      |2    |5 Days
  • Why would you want to have units of (time delta) if there is more than one time stamp for a particular X-Y combination, but then display a timestamp (different/incompatible units compared to a time delta) if there is only one time stamp for that X-Y combination? This would be mixing types. – charlesreid1 Oct 01 '17 at 10:08
  • Let's take X c and Y c for an example. With the difference of earliest time stamp and latest time stamp being 10 days apart, and having only two records. The calculation will then show that there will be a record in every 5 days. I hope I am not being too confusing – user3064676 Oct 01 '17 at 10:12
  • But that only makes sense when applied to records where there is always a count of at least 2. If count is only 1, it isn't sensible to apply. It's like trying to draw a line through a set of points - the procedure for how to do that is sensible if there are 2 or more points, but doesn't make sense when there is 1 point. It would make more sense if you require that count > 1. (Or define the frequency as 0 if count == 1. It should not be a timestamp.) – charlesreid1 Oct 01 '17 at 10:21
  • Once you clear up that definition, you can find the [min and max of each group](https://stackoverflow.com/questions/40490994/min-operation-on-nested-groupby-in-pandas) and use that to compute the time range (group max- group min). Dividing that by the group count will then get you your frequency. – charlesreid1 Oct 01 '17 at 10:27
  • Thank you very much for getting back to quickly! Please forgive me as I just started out with pandas. I have managed to display the max and min of the data by using this code df = df.groupby(['Category', 'Type'], as_index=False).agg({'Reported on':['count','max','min']}) But there was an error when I tried df['Frequency'] = df['max'] - df['min'] / df['count'] How do I get the column of the max, min and count? – user3064676 Oct 01 '17 at 10:43

0 Answers0