2

I have a DataFrame like this:

timestamp             variance
2017-07-10 20:42:42   0
2017-07-10 20:42:42   1
2017-07-10 20:42:42   2
2017-07-10 20:42:43   6
2017-07-10 20:42:43   7
2017-07-10 20:42:43   9
2017-07-10 20:42:43   3
2017-07-10 20:42:43   4
2017-07-10 20:42:43   5
2017-07-10 20:42:43   1
2017-07-10 20:42:43   4
2017-07-10 20:42:43   1
2017-07-10 20:42:43   3
2017-07-10 20:42:43   7
2017-07-10 20:42:43   9

I would like to add a new column that increments for each row in which variance is equal or greater than 5. When values drop below 5, the count should decrement instead. If the value reaches 0, it should stay at 0.

This is what it should look like:

timestamp             variance  cumvar
2017-07-10 20:42:42   0         0
2017-07-10 20:42:42   1         0
2017-07-10 20:42:42   2         0
2017-07-10 20:42:43   6         1
2017-07-10 20:42:43   7         2
2017-07-10 20:42:43   9         3
2017-07-10 20:42:43   3         2
2017-07-10 20:42:43   4         1
2017-07-10 20:42:43   5         2
2017-07-10 20:42:43   1         1
2017-07-10 20:42:43   4         0
2017-07-10 20:42:43   1         0
2017-07-10 20:42:43   3         0
2017-07-10 20:42:43   7         1
2017-07-10 20:42:43   9         2

The closest I've come to doing this is this:

df['cumvar'] = np.where((df['variance'] > 5), 1, -1).cumsum()

But of course, this doesn't apply a minimum value of 0 to the cumulative sum. How can I adapt this to achieve the above?

Alex Spurling
  • 54,094
  • 23
  • 70
  • 76
  • May be possible to do recursively with `scipy.signal.lfilter`, see posts [here](https://stackoverflow.com/questions/26267809/recursive-definitions-in-pandas) and [here](https://stackoverflow.com/questions/21336794/python-recursive-vectorization-with-timeseries). – Brad Solomon Jul 11 '17 at 14:52

3 Answers3

2

One-liner:

pd.expanding_apply(df['variance'], 
                   lambda s: reduce(lambda x,y : max(x+(1 if y-5 > 0 else -1), 0), s, 0))

But of course, readability sucks =)

You can do it the way you started doing it:

pd.expanding_apply(np.where((df['variance'] > 5), 1, -1), lambda s: reduce(lambda x,y : max(x+y, 0), s, 0))

You can improve readability if you will extract reduce function:

def tricky_func(acc, y):
    next_value = 1 if y - 5 > 0 else -1 
    return max(acc + next_value, 0)

pd.expanding_apply(df['variance'], lambda s: reduce(tricky_func, s))

Edit:

You need to import reduce from functools first you are using python 3

And if you are using pandas 0.18+ you should use

df['variance'].expanding().apply(lambda s: reduce(tricky_func, s))

notation (thanks to Brad Solomon)

Alex Pertsev
  • 931
  • 4
  • 13
  • This is good Answer ~ Thank you(Ps: I think about the same way `reduce` function) +1 – BENY Jul 11 '17 at 15:03
  • Nice answer, might want to specify `from functools import reduce` for 3.x, and also `expanding_apply` is deprecated in favor of `.expanding.apply`. ([New API](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.window.Expanding.apply.html).) – Brad Solomon Jul 11 '17 at 15:04
1

It's probably not the most elegant way to do it, but it works:

def cum_sum_limited(val, threshold=5, min_sum=0):
    global tot
    tot -= 1 if val < threshold else -1
    tot = 0 if tot < 0 else tot
    return tot

tot = 0
df['cumvar'] = df.variance.apply(cum_sum_limited)

Let me know what you think

Jan Zeiseweis
  • 3,718
  • 2
  • 17
  • 24
0

I would try a different approach. I'd iterate over df['variance'].values and create a list then append a new Series to the dataframe :

x=0
l=[]
for val in df['variance'].values:
    x = max(x+1 if val > 5 else x-1,0)
    l.append(x)
s=pd.DataFrame([l]).T
df=pd.concat([df,s],axis=1,ignore_index=True, join_axes=[df1.index])
Thomas Dussaut
  • 728
  • 1
  • 7
  • 20