2

I want to take a cumulative count by user until the value in the column events changes. I am not able to quite figure it out how to do it.

Data: (You can assume that the data will be in chronological order)

user   events 
   a   no_event
   a   no_event
   a   aaa
   a   no_event
   a   bbb
   b   asdf
   b   no_event
   b   ghtu

Desired Output:

user   events    RN
   a   no_event  1
   a   no_event  2
   a   aaa       3
   a   no_event  1
   a   bbb       2
   b   asdf      1
   b   no_event  1
   b   ghtu      2

I appreciate your help. Also, is there a tutorial or a document where I can learn all different kinds of the conditional group by operations especially on the time series data?

Asclepius
  • 57,944
  • 17
  • 167
  • 143
Krishnang K Dalal
  • 2,322
  • 9
  • 34
  • 55
  • check second answer `df1 = df.groupby(['user', 'events'], sort=False).size().reset_index(name='event_count')` – jezrael Sep 18 '19 at 08:51
  • Hey @jezrael - I have seen the solution on the link you provided but I don't think that is what I need. I have edited the question as I might have not framed the question correctly. Can you please re-open the question or try to answer it? – Krishnang K Dalal Sep 18 '19 at 09:47

1 Answers1

1

Create helper Series and pass to GroupBy.cumcount for counter:

m = df['events'].eq('no_event')
g = (m.ne(m.shift()) & m).cumsum()
df['RN2'] = df.groupby(['user', g]).cumcount().add(1)
print (df)
  user    events  RN2
0    a  no_event    1
1    a  no_event    2
2    a       aaa    3
3    a  no_event    1
4    a       bbb    2
5    b      asdf    1
6    b  no_event    1
7    b      ghtu    2

Explanation:

  1. First compare by eq with no_event
  2. Then shift values
  3. Compare for not equal
  4. Chain with mask m for only no_event rows
  5. Consecutive values by cumulative sum
print (df.assign(mask = df['events'].eq('no_event'),
                 shifted = m.shift(),
                 not_q = m.ne(m.shift()),
                 chained = (m.ne(m.shift()) & m),
                 consecut_gr = (m.ne(m.shift()) & m).cumsum()))

  user    events   mask shifted  not_q  chained  consecut_gr
0    a  no_event   True     NaN   True     True            1
1    a  no_event   True    True  False    False            1
2    a       aaa  False    True   True    False            1
3    a  no_event   True   False   True     True            2
4    a       bbb  False    True   True    False            2
5    b      asdf  False   False  False    False            2
6    b  no_event   True   False   True     True            3
7    b      ghtu  False    True   True    False            3

Still not 100% sure if necessary shift per groups, mainly it depends of data:

m = df['events'].eq('no_event')
g = (m.ne(m.groupby(df['user']).shift()) & m).cumsum()

df['RN'] = df.groupby(['user', g]).cumcount().add(1)

EDIT1: It is same:

np.random.seed(123)
N = 1000

L = ['no_event','a','s']
df = pd.DataFrame({'user': np.random.randint(100, size=N),
                   'events':np.random.choice(L,size=N)}).sort_values('user')

m = df['events'].eq('no_event')
g = (m.ne(m.shift()) & m).cumsum()
df['RN1'] = df.groupby(['user', g]).cumcount().add(1)

m = df['events'].eq('no_event')
g = (m.ne(m.groupby(df['user']).shift()) & m).cumsum()

df['RN2'] = df.groupby(['user', g]).cumcount().add(1)

print (df['RN2'].equals(df['RN1']))
True
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you for the solution. This is what I needed. I'm not able to understand what happening with `g = (m.ne(m.groupby(df['user']).shift()) & m).cumsum()`. Can you please explain? Also, is there a simpler solution than this? I'll accept this answer if not. – Krishnang K Dalal Sep 18 '19 at 10:34
  • @KrishnangKDalal - Working for explanation – jezrael Sep 18 '19 at 10:35
  • 1
    @KrishnangKDalal - I have problem create data for simulate, if is possible simplify answer like main solution or is necessary solution in last paragraph. Can you test in real data if same output? – jezrael Sep 18 '19 at 10:52
  • 1
    Thanks a lot for your answers. Your explanation is clear enough. I'm accepting your answer. Btw, thank you for introducing me to `pd.Series.ne` – Krishnang K Dalal Sep 18 '19 at 13:01