41

I'm trying to count consecutive up days in equity return data; so if a positive day is 1 and a negative is 0, a list y=[0,0,1,1,1,0,0,1,0,1,1] should return z=[0,0,1,2,3,0,0,1,0,1,2].

I've come to a solution which has few lines of code, but is very slow:

import pandas
y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])

def f(x):
    return reduce(lambda a,b:reduce((a+b)*b,x)

z = pandas.expanding_apply(y,f)

I'm guessing I'm looping through the whole list y too many times. Is there a nice Pythonic way of achieving what I want while only going through the data once? I could write a loop myself but wondering if there's a better way.

smci
  • 32,567
  • 20
  • 113
  • 146
alex314159
  • 3,159
  • 2
  • 20
  • 28
  • do you strictly want a pandas solution? – Padraic Cunningham Dec 23 '14 at 19:46
  • 1
    As to native Pyton vs pandas performance, pandas has optimized [`pd.Series.diff()`,`cumcount()`,`cumsum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) and so on. Faster than native Python, esp. slow iterative things like `reduce()` – smci Jul 12 '21 at 22:16

4 Answers4

155
>>> y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])

The following may seem a little magical, but actually uses some common idioms: since pandas doesn't yet have nice native support for a contiguous groupby, you often find yourself needing something like this.

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64

Some explanation: first, we compare y against a shifted version of itself to find when the contiguous groups begin:

>>> y != y.shift()
0      True
1     False
2      True
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10    False
dtype: bool

Then (since False == 0 and True == 1) we can apply a cumulative sum to get a number for the groups:

>>> (y != y.shift()).cumsum()
0     1
1     1
2     2
3     2
4     2
5     3
6     3
7     4
8     5
9     6
10    6
dtype: int32

We can use groupby and cumcount to get us an integer counting up in each group:

>>> y.groupby((y != y.shift()).cumsum()).cumcount()
0     0
1     1
2     0
3     1
4     2
5     0
6     1
7     0
8     0
9     0
10    1
dtype: int64

Add one:

>>> y.groupby((y != y.shift()).cumsum()).cumcount() + 1
0     1
1     2
2     1
3     2
4     3
5     1
6     2
7     1
8     1
9     1
10    2
dtype: int64

And finally zero the values where we had zero to begin with:

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64
dumbledad
  • 16,305
  • 23
  • 120
  • 273
DSM
  • 342,061
  • 65
  • 592
  • 494
  • 2
    @CodingOrange: it's the usual tradeoff between multiple passes vs. fast operations. Python iteration is slow, so when using long Series, the more of it you can push down to native `pandas` ops, the better off you'll be. Conversely, if the Series is as small as it is in the example, you'll waste more time in overhead. Exactly the same issue comes up when writing vectorized `numpy` code. – DSM Dec 23 '14 at 19:40
  • Very magical! A bit too complicated for me but very interesting, tx – alex314159 Dec 23 '14 at 21:23
  • This really did the trick. If you're like me and have a column in a df you'd start with something like `y = df.pctChange > 0` to get the boolean series. – fantabolous Oct 26 '16 at 13:11
  • 3
    **This should be the accepted answer.** Yes, very magical on first appearance - but excellent explaination. Makes perfect sense. Thanks! – S3DEV Sep 06 '19 at 07:40
  • Very hard to understand. – huang Mar 02 '21 at 15:21
  • Recent versions of pandas allow `y.diff().ne(0)` instead of `(y != y.shift())`. And you can groupby that, I don't understand the *"support for a contiguous groupby"* comment. – smci Jul 11 '21 at 21:00
  • Very cool. Can you pls show the solution in case the array is a panda's df column? – user14237226 Oct 26 '22 at 13:58
4

If something is clear, it is "pythonic". Frankly, I cannot even make your original solution work. Also, if it does work, I am curious if it is faster than a loop. Did you compare?

Now, since we've started discussing efficiency, here are some insights.

Loops in Python are inherently slow, no matter what you do. Of course, if you are using pandas, you are also using numpy underneath, with all the performance advantages. Just don't destroy them by looping. This is not to mention that Python lists take a lot more memory than you may think; potentially much more than 8 bytes * length, as every integer may be wrapped into a separate object and placed into a separate area in memory, and pointed at by a pointer from the list.

Vectorization provided by numpy should be sufficient IF you can find some way to express this function without looping. In fact, I wonder if there some way to represent it by using expressions such as A+B*C. If you can construct this function out of functions in Lapack, then you can even potentially beat ordinary C++ code compiled with optimization.

You can also use one of the compiled approaches to speed-up your loops. See a solution with Numba on numpy arrays below. Another option is to use PyPy, though you probably can't properly combine it with pandas.

In [140]: import pandas as pd
In [141]: import numpy as np
In [143]: a=np.random.randint(2,size=1000000)

# Try the simple approach
In [147]: def simple(L):
              for i in range(len(L)):
                  if L[i]==1:
                      L[i] += L[i-1]


In [148]: %time simple(L)
CPU times: user 255 ms, sys: 20.8 ms, total: 275 ms
Wall time: 248 ms


# Just-In-Time compilation
In[149]: from numba import jit
@jit          
def faster(z):
    prev=0
    for i in range(len(z)):
        cur=z[i]
        if cur==0:
             prev=0
        else:
             prev=prev+cur
             z[i]=prev

In [151]: %time faster(a)
CPU times: user 51.9 ms, sys: 1.12 ms, total: 53 ms
Wall time: 51.9 ms


In [159]: list(L)==list(a)
Out[159]: True

In fact, most of the time in the second example above was spent on Just-In-Time compilation. Instead (remember to copy, as the function changes the array).

b=a.copy()
In [38]: %time faster(b)
CPU times: user 55.1 ms, sys: 1.56 ms, total: 56.7 ms
Wall time: 56.3 ms

In [39]: %time faster(c)
CPU times: user 10.8 ms, sys: 42 µs, total: 10.9 ms
Wall time: 10.9 ms

So for subsequent calls we have a 25x-speedup compared to the simple version. I suggest you read High Performance Python if you want to know more.

Sergey Orshanskiy
  • 6,794
  • 1
  • 46
  • 50
2

A similar approach to the answer from @DSM with fewer steps:

s.groupby(s.ne(s.shift()).cumsum()).cumsum()

Output:

0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
1

Keeping things simple, using one array, one loop, and one conditional.

a = [0,0,1,1,1,0,0,1,0,1,1]

for i in range(1, len(a)):
    if a[i] == 1:
        a[i] += a[i - 1]
Dan
  • 101
  • 9