1

The following code creates a random dataframe with values -1, 0 or 1:

df = pd.DataFrame(np.random.randint(-1,2,size=(100, 1)), columns=['val'])

print(df['val'].value_counts())

Let's see what it contains:

-1    36
 0    35
 1    29
Name: val, dtype: int64

Then, I'm trying to create a new column called mysum with a cumulative conditional sum which following the next rules:

  • If val = 1 and mysum >= 0, then mysum = mysum + 1.
  • If val = 1 and mysum < 0, then mysum = mysum + 2.

  • If val = -1 and mysum <= 0, then mysum = mysum - 1.

  • If val = -1 and mysum > 0, then mysum = mysum - 2

  • If val = 0 and mysum < 0, then mysum = mysum + 1.

  • If val = 0 and mysum > 0, then mysum = mysum - 1.

  • If val = 0 and mysum = 0, then mysum = mysum.

So I'm afraid it is not as simple as:

df['mysum'] = df['val'].cumsum()

So I tried the following:

df['mysum'] = 0

df['mysum'] = np.where((df['val'] == 1) & (df['mysum'].cumsum() >= 0), (df['mysum'].cumsum() + 1), df['mysum'].cumsum())
df['mysum'] = np.where((df['val'] == 1) & (df['mysum'].cumsum() < 0), (df['mysum'].cumsum() + 2), df['mysum'].cumsum())

df['mysum'] = np.where((df['val'] == -1) & (df['mysum'].cumsum() <= 0), (df['mysum'].cumsum() - 1), df['mysum'].cumsum())
df['mysum'] = np.where((df['val'] == -1) & (df['mysum'].cumsum() > 0), (df['mysum'].cumsum() - 2), df['mysum'].cumsum())

df['mysum'] = np.where((df['val'] == 0) & (df['mysum'].cumsum() > 0), (df['mysum'].cumsum() - 1), df['mysum'].cumsum())
df['mysum'] = np.where((df['val'] == 0) & (df['mysum'].cumsum() < 0), (df['mysum'].cumsum() + 1), df['mysum'].cumsum())


print(df['mysum'].value_counts())
print(df)

But the column mysum is not accumulating!

Here is a fiddle where you can try: https://repl.it/FaXZ/8

mllamazares
  • 7,876
  • 17
  • 61
  • 89
  • Whats happens to mysum when both mysum and current val is 0? – Karl Anka Mar 15 '17 at 09:54
  • I didn't realized! I've added this casuistic as well! Thank you – mllamazares Mar 15 '17 at 09:58
  • (If so posting a looping solution would not be advisable...) – ntg Mar 15 '17 at 10:10
  • @ntg how would one go about to do it in a non looping manner? – Karl Anka Mar 15 '17 at 10:12
  • Please, check the updated question, it's an approach without looping (but it does not accumulating...) – mllamazares Mar 15 '17 at 10:14
  • 1
    Ok, I meant the obvious loop. Checking the code for cumsum, I see that it is added in generic.py from : cls.cumsum = _make_cum_function(cls, 'cumsum', name, name2, axis_descr, "cumulative sum",lambda y, axis: y.cumsum(axis), 0., np.nan), similar to cummmin and so on. There migh be some tweeks that would end up with pandas.Datafame being enriched with specialcumsum by providing the right function there... – ntg Mar 15 '17 at 10:21
  • I am not sure if your system can be modeled without loops, maybe it can may be not depending on the rules. Notice that some rules can have the property that a small change in a critical part of the input might lead to huge changes in the output. (If this can happen all the time, you get a chaotic system...) – ntg Mar 15 '17 at 10:28
  • @harrison4 as the cumulative sum depends on both myval and current mysum i cant think of how it could be done without looping. If you manage to solve it please post your solution and tag me! – Karl Anka Mar 15 '17 at 10:36

2 Answers2

1

Maybe there exists a more condensed solution, but you can loop through the dataframe and set values according to your conditions.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(-1, 2, size=(100, 1)), columns=['val'])

df['mysum'] = 0

for index, row in df.iterrows():

    # get the current value of mysum = mysum one row above current index
    mysum = df.get_value(index - 1, 1, takeable=True)

    # mysum at beginning is 0
    if index == 0:
        mysum = 0

    # set values at current index according to conditions
    if row[0] == 0 and mysum < 0:
        df.set_value(index, 1, mysum + 1, takeable=True)
    if row[0] == 1 and mysum < 0:
        df.set_value(index, 1, mysum + 2, takeable=True)
    if row[0] == -1 and mysum <= 0:
        df.set_value(index, 1, mysum - 1, takeable=True)
    if row[0] == 0 and mysum > 0:
        df.set_value(index, 1, mysum - 1, takeable=True)
    if row[0] == -1 and mysum > 0:
        df.set_value(index, 1, mysum - 2, takeable=True)
    if row[0] == 1 and mysum >= 0:
        df.set_value(index, 1, mysum + 1, takeable=True)
    if row[0] == 0 and mysum == 0:
        df.set_value(index, 1, mysum, takeable=True)

print df
Karl Anka
  • 2,529
  • 1
  • 19
  • 30
1

A much more efficient solution, see also generalized cumulative functions in NumPy/SciPy? :

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(-1, 2, size=(100, 1)), columns=['val'])
def my_sum(acc,x):
    if x == 0 and acc < 0:
        return acc + 1
    if x == 1 and acc < 0:
        return acc + 2
    if x == -1 and acc <= 0:
        return acc - 1
    if x == 0 and acc > 0:
        return acc - 1
    if x == -1 and acc > 0:
        return acc - 2
    if x == 1 and acc >= 0:
        return acc + 1
    if x == 0 and acc == 0:
        return acc
u_my_sum = np.frompyfunc(my_sum, 2, 1)
df['mysum'] = u_my_sum.accumulate(df.val, dtype=np.object).astype(np.int64)
print(df)
tillmo
  • 607
  • 5
  • 11