3

I trying to count the number of consecutive positive events for each column in a pandas dataframe. The solution provided by DSM here- Counting consecutive positive value in Python array works well for a given series.

import pandas as pd

a = [0,1,0,1,1,0,0,0,1,1,0,1,0]
b = [0,0,0,0,1,1,0,1,1,1,0,0,0]

series = pd.Series(a)

consecutiveCount(series).values

array([0, 1, 0, 1, 2, 0, 0, 0, 1, 2, 0, 1, 0], dtype=int64)

However, when I try to do this to a dataframe with several columns, I get the following.

df = pd.DataFrame({'a':a, 'b':b})
consecutiveCount(df)

ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

If I iterate though each column, it works but is very slow. Is there a vectorized way to process the entire dataframe at once?

Thanks!

Mike
  • 33
  • 1
  • 5

3 Answers3

3

Use consecutiveCounts just once in an unstacked series. Then, stack back to data frame.

Using DSM's consecutiveCount, which I named c here for simplicity:

>>> c = lambda y: y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
>>> c(df.unstack()).unstack().T

    a   b
0   0   0
1   1   0
2   0   0
3   1   0
4   2   1
5   0   2
6   0   0
7   0   1
8   1   2
9   2   3
10  0   0
11  1   0
12  0   0

Timings

# df2 is (65, 40)
df2 = pd.concat([pd.concat([df]*20, axis=1)]*5).T.reset_index(drop=True).T.reset_index(drop=True)

%timeit c(df2.unstack()).unstack().T
5.54 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df2.apply(c)
82.5 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • 1
    Wow! That really increased the speed! Fantastic and thanks for the help. Now I just need to understand why that is so much faster. – Mike Sep 18 '18 at 15:48
1

You can try the apply method. That might give you better results:

df.apply(consecutiveCount)
busybear
  • 10,194
  • 1
  • 25
  • 42
  • Thanks! I feel like an idiot for not trying that- I guess I have not adjusted well to thinking in pandas. This reduced the time from 171 ms to 164 ms. This function gets hammered in my algorithm and really slows things down. I wonder if there is a faster way to do the consecutive counting? – Mike Sep 18 '18 at 15:22
0

Adapted from @cs95's answer:

a = pd.Series([-1, 2, 15, 3, 45, 5, 23, 0, 6, -4, -8, -5, 3, 
-9, -7, -36, -71, -2, 25, 47, -8])

def pos_neg_count(a):
    v = a.ge(0).ne(a.ge(0).shift()).cumsum()
    vals = v.groupby(v).count().values
    cols = ['pos', 'neg'] if a[0] >= 0 else ['neg', 'pos']
    try:
        result = pd.DataFrame(vals.reshape(-1, 2), columns=cols)
    except ValueError:
        vals = np.insert(vals, len(vals), 0)
        result = pd.DataFrame(vals.reshape(-1, 2), columns=cols)
    return result

pos_neg_count(a)
#       neg pos
#   0     1   8
#   1     3   1
#   2     5   2
#   3     1   0
rahul-ahuja
  • 1,166
  • 1
  • 12
  • 24