0

With R studio I aim to accumulate data in one column (# of achievements) across different rows, subject to

  • Matching row: Only rows with identical person_id should be considered
  • Start date: Only rows prior to the start.date of the current row (i.e. only previous observations) should be considered
  • Position of interest: Only calculate for the positions of interest (1) to reduce computing effort

See rows 4 and 7 as examples:

Person_id Position Start.Date Position of interest # of achievements CODE REQUIRED: # of previous achievements
Martin123 Student 01/2010 0 1 0
Martin123 Intern 01/2012 0 1 0
Martin123 Student 01/2014 0 2 0
Martin123 Employee 01/2018 1 2 4
Martin123 Employee 01/2020 0 2 0
Max456 Employee 01/2010 0 5 0
Max456 Employee 01/2012 1 1 5

The traditional approaches for conditional sums appear to be too static as I was unable to adjust them in a way for them to outputs sums per row while dynamically adjusting to the person_id (I have >100k rows in my dataset so cannot be done manually).

Edit: Language clarification

  • Please could you include the source data for the shown expected results? Also, please specify which RDBMS you're using; MySQL 5.x, MySQL8, PostgreSQL, SQL Server, SQLite, Oracle, etc, etc, all have different functionality and syntax. – MatBailie Apr 16 '22 at 15:05
  • What does "position of interest" indicate? Why don't rows 2 or 3 have a non-zero value in the last column? What do you mean by "similar person_id"? Normally it would be "identical person_id"... – MatBailie Apr 16 '22 at 15:09
  • @MatBailie Thanks for your message. Source data is all but the column marked. Currently its a simple csv file loaded into R. Excuse my English, its "identical" rather than "similar". Rows 2/3 should have a zero as the # of achievements should only be counted for the positions of interest (here: rows 4 and 7). Thanks – Alexanderg Apr 16 '22 at 15:27
  • Row 7 is ***not*** marked as a position of interest. Please edit the question to; correct the typo, clarify the use of position of interest, add the tag for the R language. – MatBailie Apr 16 '22 at 16:03
  • I don't use R, so I won't answer as the syntax will likely be slightly wrong. But you just need a cumulative sum grouped by the person_id, wrapped in an immediate if (if poi == 0 then 0 else cumulative sum of achievements). Perhaps refer to https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cumsum and https://stackoverflow.com/questions/16850207/calculate-cumulative-sum-cumsum-by-group – MatBailie Apr 16 '22 at 16:06
  • Only calculating for position of interest won't reduce computing effort. Cumulative Sum needs to be calculated across all preceding rows any way. *(To calculate the 4th row R uses the cumulative sum from the 3rd row.)* So, perhaps keep it simpler and just run a cumulative sum grouped by person? – MatBailie Apr 16 '22 at 16:39
  • Happy to. Currently trying to figure out the respective code for R. – Alexanderg Apr 16 '22 at 16:42
  • The linked question (in my comment above) has multiple answers with different mechanisms for adding a cumulative sum to a dataframe. – MatBailie Apr 16 '22 at 16:52

1 Answers1

0

You can try something like:

df <- df %>%
  group_by(Person_id, `Start.Date`) %>%  
  mutate(`prv # of achievements` = cumsum(`# of achievements`) %>%
  ungroup()

This should give you the cumulative sum of achievements by date, otherwise drop the date to get overall cumulative total (e.g. a sum) by the Person.id.

Mossa
  • 1,656
  • 12
  • 16