0

Given the following data frame:

  index       value         
    1          0.8
    2          0.9
    3          1.0
    4          0.9
    5          nan
    6          nan
    7          nan
    8          0.4
    9          0.9
   10          nan
   11          0.8
   12          2.0
   13          1.4
   14          1.9
   15          nan
   16          nan
   17          nan
   18          8.4
   19          9.9
   20          10.0
   …

in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:

  index       value        cumsum  
    1          0.8          0.8
    2          0.9          1.7
    3          1.0          2.7
    4          0.9          3.6
    5          nan          0
    6          nan          0
    7          nan          0
    8          0.4          0.4
    9          0.9          1.3
   10          nan          0
   11          0.8          0.8
   12          2.0          2.8
   13          1.4          4.2
   14          1.9          6.1
   15          nan          0
   16          nan          0
   17          nan          0
   18          8.4          8.4
   19          9.9          18.3
   20          10.0         28.3
   …

Any suggestions?

Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:

  index       value        cumsum  
    1          0.8          0.8
    2          0.9          1.7
    3          1.0          2.7
    4          0.9          3.6
    5          nan          0
    6          nan          0
    7          nan          0
    8          0.4          0.4
    9          0.9          1.3
   10          nan          1.3
   11          0.8          2.1
   12          2.0          4.1
   13          1.4          5.5
   14          1.9          7.4
   15          nan          0
   16          nan          0
   17          nan          0
   18          8.4          8.4
   19          9.9          18.3
   20          10.0         28.3

Thank you for the help!

user6396
  • 1,832
  • 6
  • 23
  • 38

1 Answers1

5

You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:

n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)

to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)

produces

>>> df
    index  value  cumsum  cumsum_skip1
0       1    0.8     0.8           0.8
1       2    0.9     1.7           1.7
2       3    1.0     2.7           2.7
3       4    0.9     3.6           3.6
4       5    NaN     0.0           0.0
5       6    NaN     0.0           0.0
6       7    NaN     0.0           0.0
7       8    0.4     0.4           0.4
8       9    0.9     1.3           1.3
9      10    NaN     0.0           1.3
10     11    0.8     0.8           2.1
11     12    2.0     2.8           4.1
12     13    1.4     4.2           5.5
13     14    1.9     6.1           7.4
14     15    NaN     0.0           0.0
15     16    NaN     0.0           0.0
16     17    NaN     0.0           0.0
17     18    8.4     8.4           8.4
18     19    9.9    18.3          18.3
19     20   10.0    28.3          28.3
DSM
  • 342,061
  • 65
  • 592
  • 494
  • When I do **to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)** It gave me a error message **TypeError: 'int' object is not callable** my python is 2.7.3 BTW. Thanks! – user6396 Apr 22 '15 at 12:27
  • Also, if I try to calculate mean or max of each cluster, **df["max"] = df["value"].groupby(clusters).max().fillna(0)** doesn't work properly. – user6396 Apr 22 '15 at 12:39
  • it is solved by using **cummax**. both mean and max won't lineup the index correctly. – user6396 Apr 22 '15 at 12:48
  • @user6396: it sounds like you should read the [groupby](http://pandas.pydata.org/pandas-docs/stable/groupby.html) section of the docs -- then you'll understand why your `max` and `mean` won't work. In particular, look at `transform`. – DSM Apr 22 '15 at 14:09
  • thank you for the help. very new to pandas, really powerful tool! – user6396 Apr 22 '15 at 14:20