11

I didn't found answer elsewhere, so I need to ask. Probably because I don't know how to correctly name it. (English is not my origin language)

I have large datetime data frame. Time is important here. One column in df has values [Nan, 1, -1]. I need to perform quick calculation to have cumulative sum reseting when value is changing.

Example.

    Time                 sign    desire_value
2014-01-24 05:00:00      Nan     Nan 
2014-01-24 06:00:00      Nan     Nan
2014-01-24 07:00:00      Nan     Nan 
2014-01-24 08:00:00      1       1
2014-01-24 09:00:00      1       2
2014-01-24 10:00:00      1       3
2014-01-24 11:00:00      -1      1
2014-01-24 12:00:00      -1      2
2014-01-24 13:00:00      -1      3
2014-01-24 14:00:00      -1      4
2014-01-24 15:00:00      -1      5
2014-01-24 16:00:00      1       1
2014-01-24 17:00:00      1       2
2014-01-24 18:00:00      1       3
2014-01-24 19:00:00      -1      1
2014-01-24 20:00:00      -1      2  
2014-01-24 21:00:00      1       1
2014-01-24 22:00:00      1       2

I have working solution using function, but it is not very efficient.

    df['sign_1'] = df['sign'].shift(1)

    for index, row in df.iterrows():
        if row.sign is None:
            df.loc[line, 'desire_value'] = None
        elif row.sign == row.sign_1:
            acc += 1
            df.loc[index, 'desire_value'] = acc
        else:
            acc = 1 
            df.loc[index, 'desire_value'] = acc

I cannot find any array based approach. I found that the best way to iterate efficiently in Python is using Cython, but is there more "Python" way to solve this?

rpeczykowski
  • 165
  • 1
  • 1
  • 9

1 Answers1

19

see the last section here

This is an itertools like groupby

In [86]: v = df['value'].dropna()

The grouper is separated on the group breakpoints; cumsum makes it have separate groups

In [87]: grouper = (v!=v.shift()).cumsum()

In [88]: grouper
Out[88]: 
3     1
4     1
5     1
6     2
7     2
8     2
9     2
10    2
11    3
12    3
13    3
14    4
15    4
16    5
17    5
Name: value, dtype: int64

Then just a simple cumsum

In [89]: df.groupby(grouper)['value'].cumsum()
Out[89]: 
0    NaN
1    NaN
2    NaN
3      1
4      2
5      3
6     -1
7     -2
8     -3
9     -4
10    -5
11     1
12     2
13     3
14    -1
15    -2
16     1
17     2
dtype: float64

You can certainly .abs() the above if you do in fact want the absolute values.

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • I think he's looking more for a `cumcount() + 1` (with the nans patched in) than a cumulative sum, despite the wording. – DSM Apr 02 '15 at 20:13
  • 1
    @DSM but doesn't just adding a call to `.abs()` give what the OP wants? – EdChum Apr 02 '15 at 20:14
  • @EdChum: huh. I suppose here it would! I hadn't thought through the consequences of the values only being restricted to nan, -1, and 1. – DSM Apr 02 '15 at 20:15
  • its pretty trivial to go from what I show to using ``.abs(...)``. I happen to think this is more useful, but the OP can decide. – Jeff Apr 02 '15 at 20:28
  • 1
    @Jeff, big thanks to you . It isn't very intuitive to use groupby as itertool, but it is very efficient. This trick sped up counting df with 65k rows from 45s to 1.8s. This part of pandas documentation is not easy to study, but definitely worth it. – rpeczykowski Apr 10 '15 at 11:11
  • How can I use `df.groupby(grouper)['value'].cumsum()` to get it as percentage, so that in the end the cumsum add ups to 1 for every grouper. So I am looking for the cummlative percentage in that case – PV8 Jun 27 '19 at 09:54