add partial cum sum column to data frame

Question

I have a df sorted by product id prodID and by date Date. I need to add a column that shows a kind-of cumulative index of how many times each prodID is in the df. For example: if a proID appears only once, the index will be 1 on that row. If another prodID appears in 3 rows (which are consecutive in the df, for the df is sorted), then the index should be 1 in the first row for that very prodID, then 2, then 3 in the following rows. Basically I need that my initial df:

 initial.df <- structure(list(prodID = c("009hpOpzwl", "00An0zNeEQ", "00An0zNeEQ", "00An0zNeEQ", "00An0zNeEQ", "00DtU3Bk6O", "00DtU3Bk6O", "00FyjrH1kk", "00FyjrH1kk", "00FyjrH1kk", "00FyjrH1kk", "00FyjrH1kk"), Date = c("2012-06", "2014-09", "2014-09", "2014-09", "2014-09", "2001-11", "2001-11", "2002-11", "2002-12", "2003-01", "2003-02", "2003-03"), status = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L), .Label = c("rare", "occasional", "amateur", "connoisseur", "expert", "fool"), class = "factor"),     rating = c(2.5, 4.7, 4.7, 4.7, 4.7, 4.4, 4.4, 3.5, 3.83,     3.36, 3.53, 3.78), over = c(68, 49, 49, 49, 49, 22, 22, 29,     38.33, 43.3, 39.53, 30.58)), class = "data.frame", row.names = c(NA, -12L), .Names = c("prodID", "Date", "status", "rating", "over"))

becomes

new.df <- structure(list(prodID = c("009hpOpzwl", "00An0zNeEQ", "00An0zNeEQ", "00An0zNeEQ", "00An0zNeEQ", "00DtU3Bk6O", "00DtU3Bk6O", "00FyjrH1kk", "00FyjrH1kk", "00FyjrH1kk", "00FyjrH1kk", "00FyjrH1kk"), Date = c("2012-06", "2014-09", "2014-09", "2014-09", "2014-09", "2001-11", "2001-11", "2002-11", "2002-12", "2003-01", "2003-02", "2003-03"), status = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L), .Label = c("rare", "occasional", "amateur", "connoisseur", "expert", "fool"), class = "factor"),     rating = c(2.5, 4.7, 4.7, 4.7, 4.7, 4.4, 4.4, 3.5, 3.83,     3.36, 3.53, 3.78), over = c(68, 49, 49, 49, 49, 22, 22, 29,     38.33, 43.3, 39.53, 30.58), index = c(1, 1, 2, 3, 4, 1, 2,     1, 2, 3, 4, 5)), .Names = c("prodID", "Date", "status", "rating", "over", "index"), row.names = c(NA, -12L), class = "data.frame")

Thank you in advance for any suggestion

score 3 · Answer 1 · answered Jan 05 '15 at 10:04

3

You can use the ave function to acheive this, if the data are guaranteed to be sorted as you claim:

initial.df$index <- ave(initial.df$prodID, initial.df$prodID, FUN=function(x) seq(along=x))

answered Jan 05 '15 at 10:04

Gavin Kelly

2,374
1
10
13

works like a charm (I used `FUN=seq_along` ). Thank you – Sal Jan 05 '15 at 10:16

score 3 · Answer 2 · answered Jan 05 '15 at 10:28

3

Just for completeness, this is very straightforward operation when using data.table and will be both efficient, short syntax and creates the column by reference, simply:

library(data.table)
setDT(initial.df)[, index := seq_len(.N), prodID]

answered Jan 05 '15 at 10:28

David Arenburg

91,361
17
137
196

score 2 · Accepted Answer · answered Jan 05 '15 at 10:46

2

If this question is not being closed as a duplicate of some other and we already have the data.table answer, here's the dplyr edition:

library(dplyr)
df %>% group_by(prodID) %>% mutate(index = row_number())

answered Jan 05 '15 at 10:46

talat

68,970
21
126
157

score 1 · Answer 4 · answered Jan 05 '15 at 10:03

1

What about

do.call(rbind, lapply(split(initial.df, initial.df$prodID), function(x) cbind(x, 1:nrow(x))))

answered Jan 05 '15 at 10:03

johannes

14,043
5
40
51

add partial cum sum column to data frame

4 Answers4