Use a single, common group-specific baseline for calculations (cumsum) within sub-groups

Question

I'm looking for a tidy solution preferably using tidyverse

This question is in line with this answer, it does however have an added twist. My data has an overall grouping variable 'grp'. Within each such group, I want to perform calculations based on cumulative sum (cumsum) within sub-groups defined by 'trial', here X and Y.

However, for the calculations within both sub-groups, trial "X" and trial "Y", I need to use a single, common group-specific baseline, i.e. where trial is B.

My desired outcome is Value3 in the data set desired_outcome below:

# library(tidyverse)
# library(dplyr)
desired_outcome # see below I got this `desired_outcome`
# A tibble: 10 x 6
# Groups:   grp [2]
   grp   trial    yr value1 value2 Value3
   <chr> <fct> <dbl>  <dbl>  <dbl>  <dbl>
 1 A     B      2021      2      0      2
 2 A     X      2022      3      1      5
 3 A     X      2023      4      2     10
 4 A     Y      2022      5      3      7
 5 A     Y      2023      6      4     16
 6 B     B      2021      0      2      0
 7 B     X      2022      1      3      3
 8 B     X      2023      2      4      8
 9 B     Y      2022      3      5      5
10 B     Y      2023      4      6     14

My minimal working example. Data first,

tabl <- tribble(~grp, ~trial, ~yr, ~value1, ~value2,
                'A', "B", 2021, 2, 0,
                'A', "X", 2022, 3, 1,
                'A', "X", 2023, 4, 2,
                'A', "Y", 2022, 5, 3,
                'A', "Y", 2023, 6, 4,
                'B', "B", 2021, 0, 2,
                'B', "X", 2022, 1, 3,
                'B', "X", 2023, 2, 4,
                'B', "Y", 2022, 3, 5,
                'B', "Y", 2023, 4, 6) %>% 
 mutate(trial = factor(trial, levels = c("B", "X", "Y"))) %>% 
  arrange(grp, trial, yr)

Now, I need to use group_by(), but I can't group on trial as I need to use the baseline, B in calculations for both "X" and "Y".

undesired_outcome_tidier_code <- tabl %>% 
  group_by(grp) %>% # this do not work!
  mutate(Value1.1 = cumsum(value1),
         Value2.1 = lag(cumsum(value2), default = 0),
         Value3   = Value1.1 + Value2.1) %>% 
         select(-Value1.1, -Value2.1)

In undesired_outcome_tidier_code row 4-5 and 9-10 is, for obvious reasons, not using line 1 and 6, respectively, as base line. As shown here,

undesired_outcome_tidier_code
# A tibble: 10 x 6
# Groups:   grp [2]
   grp   trial    yr value1 value2 Value3
   <chr> <fct> <dbl>  <dbl>  <dbl>  <dbl>
 1 A     B      2021      2      0      2
 2 A     X      2022      3      1      5
 3 A     X      2023      4      2     10
 4 A     Y      2022      5      3     17
 5 A     Y      2023      6      4     26
 6 B     B      2021      0      2      0
 7 B     X      2022      1      3      3
 8 B     X      2023      2      4      8
 9 B     Y      2022      3      5     15
10 B     Y      2023      4      6     24

I am looking for a solution that gets me desired_outcome (see below) in a tidy way.

I can, in this smaller example, work my way around it, to get to my desired_outcome, but it's a cumbersome two step solution. There must be a better/tidier way.

step1 <- tabl  %>% arrange(grp, trial, yr)  %>% filter(trial  != 'Y') %>% 
  group_by(grp) %>% 
  mutate(Value1.1 = cumsum(value1),
         Value2.1 = lag(cumsum(value2), default = 0),
         Value3   = Value1.1 + Value2.1)

step2 <- tabl  %>% arrange(grp, trial, yr)  %>% filter(trial  != 'X') %>% 
  group_by(grp) %>% 
  mutate(Value1.1 = cumsum(value1),
         Value2.1 = lag(cumsum(value2), default = 0),
         Value3   = Value1.1 + Value2.1)
    
desired_outcome <- rbind(step1, 
      step2 %>% filter(trial  != 'B') 
                         ) %>% select(-Value1.1, -Value2.1) %>% arrange(grp, trial, yr)

I tried a slightly more explicit title and expanded the explanation. I hope you don't mind :) Cheers — Henrik, Mar 16 '21 at 15:05

score 3 · Answer 1 · answered Mar 13 '21 at 13:42

With the addition of purrr, you could do:

map(.x = c("X", "Y"),
    ~ tabl %>%
     arrange(grp, trial, yr) %>%
     filter(trial != .x) %>%
     group_by(grp) %>% 
     mutate(value3 = cumsum(value1) + lag(cumsum(value2), default = 0))) %>% 
 reduce(full_join) %>%
 arrange(grp, trial, yr) 

  grp   trial    yr value1 value2 value3
   <chr> <fct> <dbl>  <dbl>  <dbl>  <dbl>
 1 A     B      2021      2      0      2
 2 A     X      2022      3      1      5
 3 A     X      2023      4      2     10
 4 A     Y      2022      5      3      7
 5 A     Y      2023      6      4     16
 6 B     B      2021      0      2      0
 7 B     X      2022      1      3      3
 8 B     X      2023      2      4      8
 9 B     Y      2022      3      5      5
10 B     Y      2023      4      6     14

Edo · Answer 2 · 2021-03-13T14:54:18.167

You can try with this.

calculate_value3 is a function that calculates value3 as you described. It does it for every letter of trial. It always includes the observation of the baseline. It doesn't matter if the letters will be different than X and Y. Note that baseline can be any letter you want, I set it up as "B" for now.
Inside the pipes, you go for a map-reduce solution. map will run the function calculate_value3 for each unique trial and reduce will set them all together with coalesce (which will replace all NAs --> this is why I initialize v3 as a vector of all NAs in calculate_value3)

calculate_value3 <- function(ut, # trial under examination
                             tr, # trial vector
                             v1, # value1 vector
                             v2, # value2 vector
                             baseline = "B"){ # baseline id
  
  v3      <- rep_len(NA, length(tr))
  ind     <- ut == tr | baseline == tr
  cumv1   <- cumsum(v1[ind]) 
  cumlv2  <- cumsum(lag(v2[ind], default = 0)) 
  v3[ind] <- cumv1 + cumlv2
  v3
  
}

library(purrr)
tabl %>% 
  group_by(grp) %>% 
  mutate(value3 = reduce(
    
    map(unique(trial), calculate_value3,
        tr = trial, v1 = value1, v2 = value2), 
    
    coalesce)) %>%
  ungroup()

#> # A tibble: 10 x 6
#>    grp   trial    yr value1 value2 value3
#>    <chr> <fct> <dbl>  <dbl>  <dbl>  <dbl>
#>  1 A     B      2021      2      0      2
#>  2 A     X      2022      3      1      5
#>  3 A     X      2023      4      2     10
#>  4 A     Y      2022      5      3      7
#>  5 A     Y      2023      6      4     16
#>  6 B     B      2021      0      2      0
#>  7 B     X      2022      1      3      3
#>  8 B     X      2023      2      4      8
#>  9 B     Y      2022      3      5      5
#> 10 B     Y      2023      4      6     14

The solution is flexible to the identifiers of the trials and seems reasonably easy to debug and to edit if need be [at least to me].

Henrik · Answer 3 · 2021-03-15T06:17:01.020

Because tidyverse didn't seem like a strict requirement, I take the opportunity to suggest a data.table alternative:

Starting with the 'desired_outcome' data, just to make it easier to compare results:

library(data.table)
setDT(desired_outcome)

desired_outcome[ , v3 := {
  c(value1[1], sapply(c("X", "Y"), function(g){
    .SD[trial %in% c("B", g), (cumsum(value1) + cumsum(shift(value2, fill = 0)))[-1]]
  }))}, by = grp]

#     grp trial   yr value1 value2 Value3 v3
#  1:   A     B 2021      2      0      2  2
#  2:   A     X 2022      3      1      5  5
#  3:   A     X 2023      4      2     10 10
#  4:   A     Y 2022      5      3      7  7
#  5:   A     Y 2023      6      4     16 16
#  6:   B     B 2021      0      2      0  0
#  7:   B     X 2022      1      3      3  3
#  8:   B     X 2023      2      4      8  8
#  9:   B     Y 2022      3      5      5  5
# 10:   B     Y 2023      4      6     14 14

For each 'grp' (by = grp), loop over 'trial' "X" and "Y" (sapply(c("X", "Y")). Within each sub-dataset defined by by (.SD), select rows where 'trial' is equal to "B" or the current value of the loop (trial %in% c("B", g)).

Do the calculation (cumsum(value1) + cumsum(shift(value2, fill = 0)) and remove the first value ([-1]). Append the first row within each 'grp', i.e. the row that corresponds to trial "B" (c(value1[1], ...). Assign the result to a new variable by reference (v3 := )

Use a single, common group-specific baseline for calculations (cumsum) within sub-groups

3 Answers3