1

I want to take cumulative sum of a column based on another column and the code below successfully does it. But additionally I need to exclude the current element.

library(data.table)

cat_var <- c("rock", "indie", "rock", "rock", "pop", "indie", "rock")
cat_var_2 <- c("blue", "green", "red", "red", "blue", "green", "blue")
target_var <- c(0, 0, 1, 1, 1, 1, 0)
df <- data.table("categorical_variable" = cat_var, "categorical_variable_2" = cat_var_2, "target_variable" =  target_var)

ave(df[,"target_variable"], df[,c("categorical_variable")], FUN=cumsum) 

For now I can take cumulative sum of the target_variable based on categorical_variable. I want to take cumulative sums of both categorical_variable and categorical_variable_2 in one piece of code that excludes the current value. Like this:

ave(df[,"target_variable"], df[,c("categorical_variable", "categorical_variable_2")], FUN=cumsum) 

Expected output is:

categorical_variable_transformed <- c(0, 0, 0, 1, 0, 0, 2)
categorical_variable_2_transformed <- c(0, 0, 0, 1, 0, 0, 1)
df$categorical_variable_transformed <- categorical_variable_transformed
df$categorical_variable_2_transformed <- categorical_variable_2_transformed
user438383
  • 5,716
  • 8
  • 28
  • 43
Mine
  • 831
  • 1
  • 8
  • 27
  • sorry, the logic is not clear to me – akrun Jan 11 '22 at 18:08
  • @akrun taking the cumulative sum of the target based on categorical columns separately without including the current element. This will generate a column for each of the categorical variables – Mine Jan 11 '22 at 18:14
  • @akrun By current element I mean if we are on the 4th row we are looking at the rows above the 4th row for cumulative sum – Mine Jan 11 '22 at 18:14
  • what about if the row is 3 or 2 i.e. do you take the all other rows for the 1st element i.e. 2, 3, 4 for cumsum – akrun Jan 11 '22 at 18:15
  • (Tangent: you're using `require(.)` incorrectly, see https://stackoverflow.com/a/51263513/3358272, https://yihui.org/en/2014/07/library-vs-require/, https://r-pkgs.org/namespace.html#search-path. In this case, use `library(.)`.) – r2evans Jan 11 '22 at 18:16
  • @akrun if the row is 1 the value is 0 because there are no rows above it. Moving from top to bottom – Mine Jan 11 '22 at 18:18
  • @r2evans thank you, let me check the link – Mine Jan 11 '22 at 18:18

3 Answers3

2

Try this:

library(data.table)
nms <- c("categorical_variable", "categorical_variable_2")
df[, paste0(nms, "_transformed") :=
       lapply(nms, \(g) ave(target_variable, get(g), FUN = cumsum) - target_variable)]
df
#    categorical_variable categorical_variable_2 target_variable categorical_variable_transformed categorical_variable_2_transformed
#                  <char>                 <char>           <num>                            <num>                              <num>
# 1:                 rock                   blue               0                                0                                  0
# 2:                indie                  green               0                                0                                  0
# 3:                 rock                    red               1                                0                                  0
# 4:                 rock                    red               1                                1                                  1
# 5:                  pop                   blue               1                                0                                  0
# 6:                indie                  green               1                                0                                  0
# 7:                 rock                   blue               0                                2                                  1
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Is it possible to get the transformed columns separtely as a dataframe – Mine Jan 11 '22 at 18:42
  • 1
    I don't think that's a good idea, but `df[, lapply(setNames(nm=nms), \(g) ave(target_variable, get(g), FUN = cumsum) - target_variable)]` – r2evans Jan 11 '22 at 18:55
  • 1
    I say "not good" because the aggregate values in absence of groups or context can now fall prey to asynchronous ordering problems: when separate, if `df` row-order ever changes, the row-order of this separate table no longer matches row-for-row, and can then be used incorrectly. – r2evans Jan 11 '22 at 18:57
  • 1
    @r2evans Can I ask what `\(g)` do? How does it differ to `function(g)` – Sweepy Dodo Jan 11 '22 at 19:21
  • 2
    @SweepyDodo It's just shorthand introduced in R-4.1. I believe it adds nothing for speed, just code-golf. – r2evans Jan 11 '22 at 19:57
  • How can I make this solution work for lower R versions? For example it works for 4.2.0 but does not work for 4.0.5 – Mine Jan 13 '22 at 19:24
  • Replace `\(g)` with `function(g)`. In that form, it is working for me in R-4.1.2, R-4.0.5, and R-3.6.1 (on win10). – r2evans Jan 13 '22 at 19:52
2

We may use data.table methods as it is a data.table

nm1 <- grep("categorical", names(df), value = TRUE)
nm2 <- paste0(nm1, "_transformed")
  
for(i in seq_along(nm1)) 
   df[, (nm2)[i] := cumsum(target_variable) - target_variable, by = c(nm1[i])]

-output

> df
   categorical_variable categorical_variable_2 target_variable categorical_variable_transformed categorical_variable_2_transformed
1:                 rock                   blue               0                                0                                  0
2:                indie                  green               0                                0                                  0
3:                 rock                    red               1                                0                                  0
4:                 rock                    red               1                                1                                  1
5:                  pop                   blue               1                                0                                  0
6:                indie                  green               1                                0                                  0
7:                 rock                   blue               0                                2                                  1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Yeah, this is likely the more canonical approach, I'm overly averse to `for` loops sometimes ... (*edit* though the most recent request for "separately" makes this `for` loop method less-ideal, unfortunately ... I still upvoted this, it's a good way to do it) – r2evans Jan 11 '22 at 18:54
  • 2
    @r2evans in some recursive cases, for loop is more easier to apply than Reduce or accumulate etc. – akrun Jan 11 '22 at 18:55
1

With .SD the problem seems easy to solve:

df[, target_variable := lapply(.SD, \(x) if(length(x) > 1L) sapply(seq_along(x), \(i) cumsum(x[-i])) else x),
   by = c("categorical_variable", "categorical_variable_2")]

df
#   categorical_variable categorical_variable_2 target_variable
#1:                 rock                   blue               0
#2:                indie                  green               0
#3:                 rock                    red               1
#4:                 rock                    red               1
#5:                  pop                   blue               1
#6:                indie                  green               1
#7:                 rock                   blue               0
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • I added an example output to the question. I was not clear enough, the output should produce a column for each categorical variable – Mine Jan 11 '22 at 18:25