5

Take the following dataframe:

import pandas as pd
df = pd.DataFrame({'group_name': ['A','A','A','B','B','B'],
                   'timestamp': [4,6,1000,5,8,100],
                   'condition': [True,True,False,True,False,True]})

I want to add two columns:

  1. The row's order within its group
  2. rolling sum of the condition column within each group

I know I can do it with a custom apply, but I'm wondering if anyone has any fun ideas? (Also this is slow when there are many groups.) Here's one solution:

def range_within_group(input_df):
    df_to_return = input_df.copy()
    df_to_return = df_to_return.sort('timestamp')
    df_to_return['order_within_group'] = range(len(df_to_return))
    df_to_return['rolling_sum_of_condition'] = df_to_return.condition.cumsum()
    return df_to_return

df.groupby('group_name').apply(range_within_group).reset_index(drop=True)
exp1orer
  • 11,481
  • 7
  • 38
  • 51
  • 2
    Could also use `rank()`to get the order (aka rank) based on the timestamp. Also gives you a way to deal with ties, if that is needed. – JohnE Jun 09 '15 at 23:28

1 Answers1

7

GroupBy.cumcount does:

Number each item in each group from 0 to the length of that group - 1.

so simply:

>>> gr = df.sort('timestamp').groupby('group_name')
>>> df['order_within_group'] = gr.cumcount()
>>> df['rolling_sum_of_condition'] = gr['condition'].cumsum()

On pandas >= 0.2
df.sort() is not valid anymore, you have to use df.sort_values()

>>> gr = df.sort_values('timestamp').groupby('group_name')
>>> df['order_within_group'] = gr.cumcount()
>>> df['rolling_sum_of_condition'] = gr['condition'].cumsum()
el Josso
  • 159
  • 2
  • 16
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • But according to which order? Why would it order it specifically in the order of timestamps? – Ami Tavory Jun 09 '15 at 23:15
  • @AmiTavory added `.sort` before `groupby`. happy now? – behzad.nouri Jun 09 '15 at 23:18
  • Even assuming that it's guaranteed that ``groupby`` is stable (that is, preserves order), given that sorting is *Theta(n log(n))* (that is, superlinear), it is more efficient to sort each group separately. – Ami Tavory Jun 09 '15 at 23:20
  • @AmiTavory `groupby` is stable, otherwise all the `cum***` functions such as `cumsum`, `cumcount` will be useless. – behzad.nouri Jun 09 '15 at 23:33
  • FWIW, I removed the downvote, because following your correction, it now is correct. I still think that the complexity could be made lower - for superlinear functions, application in part works better than whole. All the best. – Ami Tavory Jun 09 '15 at 23:35
  • 2
    @AmiTavory all of this soln is completely vectorized or in cython. Using an apply (even with a fast lambda) will be orders of magnitude slower on any real dataset. As the apply is essentially a python loop. – Jeff Jun 10 '15 at 00:48
  • @Jeff *that* is actually a good point; notwithstanding, I'm really not sure about your statement concerning a "real dataset". As this solution is *Theta(c n log(n))*, where *n, c* are the rows, cols, respectively, and JohnE's comment above is *Theta(\sum_g|g| log(\sum_g|g|))* (*g* being the groups), it at least depends on the width, height, and number of groups. Also, as fast in constants as pandas is, orders of growth eventually trump. – Ami Tavory Jun 10 '15 at 01:01
  • 2
    @AmiTavory so, I would actually say linearly related not only to the sorting algo (v.s. the amortized cost of the argsort), BUT, really to the number of groups. E.g. some quick tests on say 1M rows. with 1000 groups the times are about the same! but with 100000 groups, the sorting method is MUCH faster. So algo doesn't ALWAYS win when the cost of implementing the algo are higher or the constants are small on another algo that has only slightly worse complexity. – Jeff Jun 10 '15 at 01:18
  • @Jeff interesting! BTW, did not see the number of columns in your above numbers - I bet they would change the tipping point too. However, I agree with your point that there's a balance to be made between orders of growth and constants for real-life problems. I found your comments interesting. Thanks. – Ami Tavory Jun 10 '15 at 01:24
  • 1
    Note: `sort` needs to be replaced by `sort_values` for this to work with newer versions of pandas (0.20 and upwards). https://stackoverflow.com/a/44123892/3107708 – Johann Sep 03 '20 at 07:33