Calculating the accumulated summation of clustered data in data frame in pandas

Question

Given the following data frame:

  index       value         
    1          0.8
    2          0.9
    3          1.0
    4          0.9
    5          nan
    6          nan
    7          nan
    8          0.4
    9          0.9
   10          nan
   11          0.8
   12          2.0
   13          1.4
   14          1.9
   15          nan
   16          nan
   17          nan
   18          8.4
   19          9.9
   20          10.0
   …

in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:

  index       value        cumsum  
    1          0.8          0.8
    2          0.9          1.7
    3          1.0          2.7
    4          0.9          3.6
    5          nan          0
    6          nan          0
    7          nan          0
    8          0.4          0.4
    9          0.9          1.3
   10          nan          0
   11          0.8          0.8
   12          2.0          2.8
   13          1.4          4.2
   14          1.9          6.1
   15          nan          0
   16          nan          0
   17          nan          0
   18          8.4          8.4
   19          9.9          18.3
   20          10.0         28.3
   …

Any suggestions?

Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:

  index       value        cumsum  
    1          0.8          0.8
    2          0.9          1.7
    3          1.0          2.7
    4          0.9          3.6
    5          nan          0
    6          nan          0
    7          nan          0
    8          0.4          0.4
    9          0.9          1.3
   10          nan          1.3
   11          0.8          2.1
   12          2.0          4.1
   13          1.4          5.5
   14          1.9          7.4
   15          nan          0
   16          nan          0
   17          nan          0
   18          8.4          8.4
   19          9.9          18.3
   20          10.0         28.3

Thank you for the help!

take a look at this http://stackoverflow.com/questions/18196811/cumsum-reset-at-nan — Bob Haffner, Apr 22 '15 at 03:51
the answer there are only valid for some very specific numbers. — user6396, Apr 22 '15 at 11:58

score 5 · Accepted Answer · answered Apr 22 '15 at 04:00

You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:

n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)

to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)

produces

>>> df
    index  value  cumsum  cumsum_skip1
0       1    0.8     0.8           0.8
1       2    0.9     1.7           1.7
2       3    1.0     2.7           2.7
3       4    0.9     3.6           3.6
4       5    NaN     0.0           0.0
5       6    NaN     0.0           0.0
6       7    NaN     0.0           0.0
7       8    0.4     0.4           0.4
8       9    0.9     1.3           1.3
9      10    NaN     0.0           1.3
10     11    0.8     0.8           2.1
11     12    2.0     2.8           4.1
12     13    1.4     4.2           5.5
13     14    1.9     6.1           7.4
14     15    NaN     0.0           0.0
15     16    NaN     0.0           0.0
16     17    NaN     0.0           0.0
17     18    8.4     8.4           8.4
18     19    9.9    18.3          18.3
19     20   10.0    28.3          28.3

When I do **to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)** It gave me a error message **TypeError: 'int' object is not callable** my python is 2.7.3 BTW. Thanks! — user6396, Apr 22 '15 at 12:27
Also, if I try to calculate mean or max of each cluster, **df["max"] = df["value"].groupby(clusters).max().fillna(0)** doesn't work properly. — user6396, Apr 22 '15 at 12:39
it is solved by using **cummax**. both mean and max won't lineup the index correctly. — user6396, Apr 22 '15 at 12:48
@user6396: it sounds like you should read the [groupby](http://pandas.pydata.org/pandas-docs/stable/groupby.html) section of the docs -- then you'll understand why your `max` and `mean` won't work. In particular, look at `transform`. — DSM, Apr 22 '15 at 14:09
thank you for the help. very new to pandas, really powerful tool! — user6396, Apr 22 '15 at 14:20

Calculating the accumulated summation of clustered data in data frame in pandas

1 Answers1