Cumulate previous data in rows based on date and identifier

Question

With R studio I aim to accumulate data in one column (# of achievements) across different rows, subject to

Matching row: Only rows with identical person_id should be considered
Start date: Only rows prior to the start.date of the current row (i.e. only previous observations) should be considered
Position of interest: Only calculate for the positions of interest (1) to reduce computing effort

See rows 4 and 7 as examples:

Person_id	Position	Start.Date	Position of interest	# of achievements	CODE REQUIRED: # of previous achievements
Martin123	Student	01/2010	0	1	0
Martin123	Intern	01/2012	0	1	0
Martin123	Student	01/2014	0	2	0
Martin123	Employee	01/2018	1	2	4
Martin123	Employee	01/2020	0	2	0
Max456	Employee	01/2010	0	5	0
Max456	Employee	01/2012	1	1	5

The traditional approaches for conditional sums appear to be too static as I was unable to adjust them in a way for them to outputs sums per row while dynamically adjusting to the person_id (I have >100k rows in my dataset so cannot be done manually).

Edit: Language clarification

Please could you include the source data for the shown expected results? Also, please specify which RDBMS you're using; MySQL 5.x, MySQL8, PostgreSQL, SQL Server, SQLite, Oracle, etc, etc, all have different functionality and syntax. — MatBailie, Apr 16 '22 at 15:05
What does "position of interest" indicate? Why don't rows 2 or 3 have a non-zero value in the last column? What do you mean by "similar person_id"? Normally it would be "identical person_id"... — MatBailie, Apr 16 '22 at 15:09
@MatBailie Thanks for your message. Source data is all but the column marked. Currently its a simple csv file loaded into R. Excuse my English, its "identical" rather than "similar". Rows 2/3 should have a zero as the # of achievements should only be counted for the positions of interest (here: rows 4 and 7). Thanks — Alexanderg, Apr 16 '22 at 15:27
Row 7 is ***not*** marked as a position of interest. Please edit the question to; correct the typo, clarify the use of position of interest, add the tag for the R language. — MatBailie, Apr 16 '22 at 16:03
I don't use R, so I won't answer as the syntax will likely be slightly wrong. But you just need a cumulative sum grouped by the person_id, wrapped in an immediate if (if poi == 0 then 0 else cumulative sum of achievements). Perhaps refer to https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cumsum and https://stackoverflow.com/questions/16850207/calculate-cumulative-sum-cumsum-by-group — MatBailie, Apr 16 '22 at 16:06
Only calculating for position of interest won't reduce computing effort. Cumulative Sum needs to be calculated across all preceding rows any way. *(To calculate the 4th row R uses the cumulative sum from the 3rd row.)* So, perhaps keep it simpler and just run a cumulative sum grouped by person? — MatBailie, Apr 16 '22 at 16:39
Happy to. Currently trying to figure out the respective code for R. — Alexanderg, Apr 16 '22 at 16:42
The linked question (in my comment above) has multiple answers with different mechanisms for adding a cumulative sum to a dataframe. — MatBailie, Apr 16 '22 at 16:52

score 0 · Answer 1 · edited Apr 16 '22 at 22:22

0

You can try something like:

df <- df %>%
  group_by(Person_id, `Start.Date`) %>%  
  mutate(`prv # of achievements` = cumsum(`# of achievements`) %>%
  ungroup()

This should give you the cumulative sum of achievements by date, otherwise drop the date to get overall cumulative total (e.g. a sum) by the Person.id.

edited Apr 16 '22 at 22:22

Mossa

1,656
12
16

answered Apr 16 '22 at 17:29

user14836563

1
3

Cumulate previous data in rows based on date and identifier

1 Answers1