14

I have a data frame like this:

0        04:10  obj1
1        04:10  obj1
2        04:11  obj1
3        04:12  obj2
4        04:12  obj2
5        04:12  obj1
6        04:13  obj2

Wanted to get a cumulative count for all the objects like this:

idx      time   object   obj1_count   obj2_count 
0        04:10  obj1        1             0
1        04:10  obj1        2             0
2        04:11  obj1        3             0
3        04:12  obj2        3             1
4        04:12  obj2        3             2
5        04:12  obj1        4             2
6        04:13  obj2        4             3

Tried playing with cumsum but not sure that is the right way. Any suggestions?

cottontail
  • 10,268
  • 18
  • 50
  • 51
jincept
  • 279
  • 2
  • 5
  • 16

4 Answers4

24

There is a special function for such operation: cumcount

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], columns=['A'])
>>> df
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)
0    3
1    2
2    1
3    1
4    0
5    0
 dtype: int64
Adarsh Madrecha
  • 6,364
  • 11
  • 69
  • 117
Alex Glinsky
  • 3,346
  • 1
  • 17
  • 14
11

You can just compare the column against the value of interest and call cumsum:

In [12]:
df['obj1_count'] = (df['object'] == 'obj1').cumsum()
df['obj2_count'] = (df['object'] == 'obj2').cumsum()
df

Out[12]:
      time object  obj1_count  obj2_count
idx                                      
0    04:10   obj1           1           0
1    04:10   obj1           2           0
2    04:11   obj1           3           0
3    04:12   obj2           3           1
4    04:12   obj2           3           2
5    04:12   obj1           4           2
6    04:13   obj2           4           3

Here the comparison will produce a boolean series:

In [13]:
df['object'] == 'obj1'

Out[13]:
idx
0     True
1     True
2     True
3    False
4    False
5     True
6    False
Name: object, dtype: bool

when you call cumsum on the above the True values are converted to 1 and False to 0 and are summed cumulatively

EdChum
  • 376,765
  • 198
  • 813
  • 562
3

You can generalize this process by getting the cumsum of pd.get_dummies. This should work for an arbitrary number of objects you want to count, without needing to specify each one individually:

# Get the cumulative counts.
counts = pd.get_dummies(df['object']).cumsum()

# Rename the count columns as appropriate.
counts = counts.rename(columns=lambda col: col+'_count')

# Join the counts to the original df.
df = df.join(counts)

The resulting output:

    time object  obj1_count  obj2_count
0  04:10   obj1           1           0
1  04:10   obj1           2           0
2  04:11   obj1           3           0
3  04:12   obj2           3           1
4  04:12   obj2           3           2
5  04:12   obj1           4           2
6  04:13   obj2           4           3

You can omit the rename step if it's acceptable to use count as a prefix instead of a suffix, i.e. 'count_obj1' instead of 'obj1_count'. Simply use the prefix parameter of pd.get_dummies:

 counts = pd.get_dummies(df['object'], prefix='count').cumsum()
root
  • 32,715
  • 6
  • 74
  • 87
2

Here's a way using numpy

u, iv = np.unique(
    df.object.values,
    return_inverse=True
)

objcount = pd.DataFrame(
    (iv[:, None] == np.arange(len(u))).cumsum(0),
    df.index, u
)
pd.concat([df, objcount], axis=1)

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624