0

Hopefully the title is explicit enough.

I have a table looking like that :

classes id value
a       1  10
a       2  15
a       3  12
b       1  5
b       2  9
b       3  7
c       1  6
c       2  14
c       3  6

and here is what I would like :

classes id value cumsum
a       1  10    10
a       2  15    25
a       3  12    37
b       1  5     5
b       2  9     14
b       3  7     21
c       1  6     6
c       2  14    20
c       3  6     26

I've seen this solution, and I've already applied it successfully to cases where I don't have multiple classes :

id value cumsum
1  10    10
2  15    25
3  12    37

It was reasonably fast, even with datasets of size equivalent to the one I'm currently working on.

However, when I try to apply the exact same code to the dataset I'm working on now (which looks like the first table of this question, IE multiple classes), without subsetting it by a,b,c, it seems to me that it's taking ages (it's been running for 4 hours now. The dataset is 40.000 rows).

Any idea if there is an issue with the code from the linked answer, when used in this context ? I have trouble wrapping my head around the triangular join thingy, but I have the feeling there might be an issue with the size the join takes when the number of rows increases, thus slowing the whole thing a lot, which maybe is even worsened by the fact that there are multiple "classes" on which to do the cumulative sums.

Is there any way this could be done faster ? I'm using SQL in R through the SQLDF package. A solution in either R code (with or without an external common package) or SQL code will do.

Thanks

Community
  • 1
  • 1
François M.
  • 4,027
  • 11
  • 30
  • 81

2 Answers2

3

In SQL, you can do a cumulative sum using the ANSI standard sum() over () functionality:

select classes, id, value,
       sum(value) over (partition by classes order by id) as cumesum
from t;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 2
    `sqldf` package uses sqlite, so above might not work. Relevant post:http://stackoverflow.com/questions/4074257/sqlite-equivalent-of-row-number-over-partition-by – zx8754 Feb 11 '16 at 13:15
  • 2
    sqldf uses sqlite *by default* but sqldf can use PostgreSQL in which case the above should work. – G. Grothendieck Feb 16 '16 at 00:48
3

Or you can use by from the base package:

df$cumsum <- unlist(by(df$value, df$classes, cumsum))
#  classes id value cumsum
#1       a  1    10     10
#2       a  2    15     25
#3       a  3    12     37
#4       b  1     5      5
#5       b  2     9     14
#6       b  3     7     21
#7       c  1     6      6
#8       c  2    14     20
#9       c  3     6     26
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • Actually it doesn't seem to be working (I might be doing something wrong). I'm not sure but it seems to be doing the cumulative over the whole thing and not `by` each class. Here is what I do : `df[with(df, order(classes, another_value, decreasing = TRUE))]` and then `df$cumsum <- unlist(by(df$value, df$classes, cumsum))` – François M. May 18 '16 at 12:03
  • What is `another_value`? The code is working with your example dataset. – mtoto May 18 '16 at 12:07
  • `another_value` is `id` for instance, a column on which I want to order. (NB : there is another trouble, it doesn't to be doing the cumulative sum over the `value` column...) – François M. May 18 '16 at 12:12
  • 1
    You can ask a new question, given you provide example data and code. – mtoto May 18 '16 at 12:13