4

I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.

import pandas as pd
import numpy as np

df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
    np.arange(1, 11), np.ones(10)]).T

df.columns = ['id', 'stamp', 'increment']

df_result = pd.DataFrame([ np.array(['a','b','c','b','a']), 
        np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T

df_result.columns = ['id', 'stamp', 'increment_sum']

In [2]: df
Out[2]:
  id stamp increment
0  a     1         1
1  a     2         1
2  a     3         1
3  b     4         1
4  c     5         1
5  b     6         1
6  b     7         1
7  a     8         1
8  a     9         1
9  a    10         1

In [3]: df_result
Out[3]:
   id    stamp         increment_sum
0  a     1             3
1  b     4             1
2  c     5             1
3  b     6             2
4  a     8             3

I can accomplish this via

def get_result(d):
    sum = d.increment.sum()
    stamp = d.stamp.min()
    name = d.id.max()
    return name, stamp, sum

#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()

result = zip(*df.groupby([df.key]).apply(get_result))

df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']

But I'm sure there must be a more elegant solution

Tom M
  • 1,292
  • 13
  • 18

1 Answers1

2

Not that good in terms of optimum code, but solves the problem

> df_group = df.groupby('id')

we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not

> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
  id stamp increment  group_diff
0  a     1         1           0
1  a     2         1           0
2  a     3         1           0
3  b     4         1           0
4  c     5         1           0
5  b     6         1           2
6  b     7         1           2
7  a     8         1           5
8  a     9         1           5
9  a    10         1           5

Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function

> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')

  id  stamp  increment
0  a      1          3
2  b      4          1
4  c      5          1
3  b      6          2
1  a      8          3
Kathirmani Sukumar
  • 10,445
  • 5
  • 33
  • 34
  • Output doesn't match the expected output. I think you need to sort it my 'stamp' – Alexander May 18 '15 at 18:17
  • 1
    Look like that is working, but FWIW, you could simplify with the trick here http://stackoverflow.com/questions/29142487/calculating-the-number-of-specific-consecutive-equal-values-in-a-vectorized-way/29143354#29143354 which is basically: `(df['id'] != df['id'].shift()).cumsum()` Basically saves you from lamda and type conversion. – JohnE May 18 '15 at 18:23
  • How to do it if there is no column 'stamp'? – Harper Koo May 02 '17 at 07:17
  • And It also has bug. If the input is df = pd.DataFrame([np.array(['a','a','b','b','a','a','b','b','a','a']), np.arange(1, 11), np.ones(10)]).T you cannot separate the non consecutive a and b. – Harper Koo May 02 '17 at 08:02