I have just read this question: In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?
and I am completely baffled by the answer. How does this work???
I mean, when I groupby('user')
shouldn't the result be, well, grouped by user?
Whatever the function I use (mean, sum etc) I would expect a result like this:
aa=pd.DataFrame([{'user':'F','time':0},
{'user':'T','time':0},
{'user':'T','time':0},
{'user':'T','time':1},
{'user':'B','time':1},
{'user':'K','time':2},
{'user':'J','time':2},
{'user':'T','time':3},
{'user':'J','time':4},
{'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)
user
B 5
F 0
J 6
K 2
T 4
Name: time, dtype: int64
How does diff() instead return a diff of each row with the previous, within each group?
aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
time user diff
0 0 F NaN
1 0 T NaN
2 0 T 0.0
3 1 T 1.0
4 1 B NaN
5 2 K NaN
6 2 J NaN
7 3 T 2.0
8 4 J 2.0
9 4 B 3.0
And more important, how is the result not a unique list of 'user' values? I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.