1

I have just read this question: In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?

and I am completely baffled by the answer. How does this work???

I mean, when I groupby('user') shouldn't the result be, well, grouped by user? Whatever the function I use (mean, sum etc) I would expect a result like this:

aa=pd.DataFrame([{'user':'F','time':0},
             {'user':'T','time':0},
            {'user':'T','time':0},
            {'user':'T','time':1},
            {'user':'B','time':1},
            {'user':'K','time':2},
            {'user':'J','time':2},
            {'user':'T','time':3},
            {'user':'J','time':4},
            {'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)

user
B    5
F    0
J    6
K    2
T    4
Name: time, dtype: int64

How does diff() instead return a diff of each row with the previous, within each group?

aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
   time user  diff
0     0    F   NaN
1     0    T   NaN
2     0    T   0.0
3     1    T   1.0
4     1    B   NaN
5     2    K   NaN
6     2    J   NaN
7     3    T   2.0
8     4    J   2.0
9     4    B   3.0

And more important, how is the result not a unique list of 'user' values? I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.

AlePorro
  • 111
  • 1
  • 11
  • 1
    Main difference is because `sum`, `mean` aggregate values - `reduce` and `diff`, `cumsum` functions return not aggregation `Series` with same size as original df. This is 2 different groups of functions. – jezrael Jun 01 '18 at 09:26

0 Answers0