1

I have a very large dataset that looks simplified like this:

row.    member_id   entry_id    comment_count   timestamp
1       1            a              4           2008-06-09 12:41:00
2       1            b              1           2008-07-14 18:41:00
3       1            c              3           2008-07-17 15:40:00
4       2            d              12          2008-06-09 12:41:00
5       2            e              50          2008-09-18 10:22:00
6       3            f              0           2008-10-03 13:36:00

I now want to create new a column, in which I sum up all the "commen_count" of previous ideas ("ID") from the same member. So I only want to sum up the comment_counts from entried that occured before the current entry. I can order my dataset by the member id and the timestamp.

The result should look something like this:

row.    member_id   entry_id    comments_count  timestamp             aggregated_count
1       1            a              4           2008-06-09 12:41:00        4
2       1            b              1           2008-07-14 18:41:00        5
3       1            c              3           2008-07-17 15:40:00        8
4       2            d              12          2008-06-09 12:41:00        12
5       2            e              50          2008-09-18 10:22:00        62
6       3            f              0           2008-10-03 13:36:00        0

Some idea how I can do this in R (or Stata)? I tried aggregate, but I don't understand how to sum only the comment_counts before the curren timestamp and only those with the current member_id.

Nikolas
  • 132
  • 1
  • 7
  • 1
    Provide a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your data so that others can work on it and answer your question. Without sample data, it'll be difficult to help you. – talat Dec 16 '14 at 18:27
  • 4
    Please read [**about SO**](http://stackoverflow.com/tour): "Include details about what you have tried"; "Don't ask ... [q]uestions you haven't tried to find an answer for (show your work!)". Please provide a **minimal, self contained example**: see [**here**](http://stackoverflow.com/help/mcve), [**here**](http://www.sscce.org/), [**here**](http://adv-r.had.co.nz/Reproducibility.html), and [**here**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Show us the code you have tried, where you are stuck, and the expected output. – Henrik Dec 16 '14 at 18:29
  • Hint: this should be trivial, but don't make other people do the work of imagining what the data are like. Show us. – Nick Cox Dec 16 '14 at 19:04
  • Thank you very much so far for the links. I tried to provide a good example and hope you can help me. – Nikolas Dec 16 '14 at 22:45

2 Answers2

2

Try this (assuming df is your data)

transform(df, aggregated_count = ave(comments_count, member_id, FUN = cumsum))
#   member_id entry_id comments_count           timestamp aggregated_count
# 1         1        a              4 2008-06-09 12:41:00                4
# 2         1        b              1 2008-07-14 18:41:00                5
# 3         1        c              3 2008-07-17 15:40:00                8
# 4         2        d             12 2008-06-09 12:41:00               12
# 5         2        e             50 2008-09-18 10:22:00               62
# 6         3        f              0 2008-10-03 13:36:00                0

Some additional ways (introduced for efficiency gain):

library(data.table)
setDT(df)[, aggregated_count := cumsum(comments_count), member_id]

Or

library(dplyr)
df %>%
  group_by(member_id) %>%
  mutate(aggregated_count = cumsum(comments_count))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    This is what happens when you provide a reproducible example and desired output. You get correct answers in less than 5 minutes :) – David Arenburg Dec 16 '14 at 22:58
2

With Stata:

clear
set more off

*----- example data -----

input ///
row    member_id   str1 entry_id    comment_count   str30 timestamp
1       1            a              4           2008-06-09 12:41:00
2       1            b              1           2008-07-14 18:41:00
3       1            c              3           2008-07-17 15:40:00
4       2            d              12          2008-06-09 12:41:00
5       2            e              50          2008-09-18 10:22:00
6       3            f              0           2008-10-03 13:36:00
end

list

*----- what you want -----

bysort member_id: gen s = sum(comment_count)

list

This simply involves the use of the by: prefix.

Roberto Ferrer
  • 11,024
  • 1
  • 21
  • 23