data.table grouping across multiple sequential factors

Question

I'm trying to estimate some parameters across n factors in a data.table. While I'm familiar with using the by functionality to perform an operation by a factor; doing this for multiple sequential factors is causing some problems.

As an example, with the simplified set

df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), Variable = round(rnorm(16), 2))

Group Variable
1:     A     0.13    
2:     A     0.26
3:     B    -1.36
4:     B    -0.78
5:     B    -0.92
6:     C     0.00
7:     C    -2.49
8:     D    -1.85
9:     D     0.37
10:    D    -0.57
11:    D     1.42
12:    E    -0.72
13:    F    -1.04
14:    F     1.86
15:    F     0.49
16:    F     1.61

Using df[, mean(Variable), by = Group] would give the mean for each Group. However, I'd like to calculate the mean for the previous n Groups.
I've tried using M[, zoo::rollapply(Variable, n, mean), by = Group], however, because the Groups are of different sizes using a fixed n will not work.

What would like is functionality akin to df[, mean(Variable), by = "This Group and previous n Groups].

The output I'm trying to achieve (for the case of n = 3) would look like

Group Variable
1:     A    NA    
2:     A    NA
3:     B    NA
4:     B    NA
5:     B    NA
6:     C    0.13
7:     C    0.13
8:     D    -1.36
9:     D    -1.36
10:    D    -1.36
11:    D    -1.36
12:    E    0
13:    F    -1.85
14:    F    -1.85
15:    F    -1.85
16:    F    -1.85

Any help would be appreciated.

if you are looking for `rollapply` using variable window width you should check this question: http://stackoverflow.com/questions/21368245/adaptive-rolling-window-function-top-performance-in-r keep in mind the highest upvoted answer at the moment does not answer the question. — jangorecki, Aug 09 '15 at 11:41

Khashaa · Accepted Answer · 2015-08-09T13:36:07.163

library(data.table)
library(RcppRoll)
df1 <- df[, .(n=.N, S=sum(Variable)), by = Group]
df1[, NewVariable:=roll_sum(S, 3, align="right", fill=NA)/roll_sum(n, 3, align="right", fill=NA),]
df[df1, on="Group"]
    Group Variable n     S NewVariable
 1:     A    -0.63 2 -0.45          NA
 2:     A     0.18 2 -0.45          NA
 3:     B    -0.84 3  1.09          NA
 4:     B     1.60 3  1.09          NA
 5:     B     0.33 3  1.09          NA
 6:     C    -0.82 2 -0.33  0.04428571
 7:     C     0.49 2 -0.33  0.04428571
 8:     D     0.74 4  2.52  0.36444444
 9:     D     0.58 4  2.52  0.36444444
10:     D    -0.31 4  2.52  0.36444444
11:     D     1.51 4  2.52  0.36444444
12:     E     0.39 1  0.39  0.36857143
13:     F    -0.62 4 -1.75  0.12888889
14:     F    -2.21 4 -1.75  0.12888889
15:     F     1.12 4 -1.75  0.12888889
16:     F    -0.04 4 -1.75  0.12888889

I hope my solution is self-explanatory.

dplyr equivalent is

df %>% 
  group_by(Group) %>% 
  summarise(n=n(), S=sum(Variable)) %>% 
  mutate(NewVar=roll_sum(S, 3, align="right", fill=NA)/roll_sum(n, 3, align="right", fill=NA)) %>% 
  left_join(df, by="Group")

Data

set.seed(1)
df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), Variable = round(rnorm(16), 2))

Package info

[1] RcppRoll_0.2.2   data.table_1.9.5

In retrospect, creating a new df1 with the variable sums by group, then merging with the original df seems obvious. However, this solution was beyond me. The dplyr solution here (and in other useful answers), lead me to believe it's probably time to devote some effort to this package. Thanks all for the helpful direction. — J..S, Aug 10 '15 at 09:54

MichaelChirico · Answer 2 · 2015-11-20T18:44:19.693

This may not be the most efficient way, but it works:

First, let's set the seed for reproducibility:

set.seed(1038)
> df
    Group Variable
 1:     A    -0.86
 2:     A     0.57
 3:     B     0.10
 4:     B    -1.57
 5:     B     1.73
 6:     C    -0.56
 7:     C     0.54
 8:     D    -1.71
 9:     D    -0.47
10:     D    -1.00
11:     D     1.03
12:     E    -0.47
13:     F    -1.06
14:     F    -2.06
15:     F    -0.57
16:     F     1.70

Now eliminate cast Group as an integer to make n-1 more tangible, then condense all multiple observations by grp_no:

setkey(df[ , grp_no := as.integer(as.factor(Group))], grp_no)
df_ttls <- df[ , .(ttl = sum(Variable), .N), by = grp_no]
> df_ttls
   grp_no   ttl N
1:      1 -0.29 2
2:      2  0.26 3
3:      3 -0.02 2
4:      4 -2.15 4
5:      5 -0.47 1
6:      6 -1.99 4

Now create the weighted average you seek using shift:

df_ttls[ , lag3avg := rowSums(sapply(0:2, shift, x = ttl))/
              rowSums(sapply(0:2, shift, x = N))]

And merge back to the full data set:

df[df_ttls, lag3avg := i.lag3avg][ ]
    Group Variable grp_no      lag3avg
 1:     A    -0.86      1           NA
 2:     A     0.57      1           NA
 3:     B     0.10      2           NA
 4:     B    -1.57      2           NA
 5:     B     1.73      2           NA
 6:     C    -0.56      3 -0.007142857
 7:     C     0.54      3 -0.007142857
 8:     D    -1.71      4 -0.212222222
 9:     D    -0.47      4 -0.212222222
10:     D    -1.00      4 -0.212222222
11:     D     1.03      4 -0.212222222
12:     E    -0.47      5 -0.377142857
13:     F    -1.06      6 -0.512222222
14:     F    -2.06      6 -0.512222222
15:     F    -0.57      6 -0.512222222
16:     F     1.70      6 -0.512222222

Note that this can easily be extended to a function:

k_lag_avg <- function(k){
  df[df_ttls[ , .(grp_no, rowSums(sapply(1:k - 1L, shift, x = ttl))/
                  rowSums(sapply(1:k -1L, shift, x = N)))],
     paste0("lag", k, "avg") := i.V2]
}

k_lag_avg(5L); df[ ]
    Group Variable grp_no      lag3avg    lag5avg
 1:     A    -0.86      1           NA         NA
 2:     A     0.57      1           NA         NA
 3:     B     0.10      2           NA         NA
 4:     B    -1.57      2           NA         NA
 5:     B     1.73      2           NA         NA
 6:     C    -0.56      3 -0.007142857         NA
 7:     C     0.54      3 -0.007142857         NA
 8:     D    -1.71      4 -0.212222222         NA
 9:     D    -0.47      4 -0.212222222         NA
10:     D    -1.00      4 -0.212222222         NA
11:     D     1.03      4 -0.212222222         NA
12:     E    -0.47      5 -0.377142857 -0.2225000
13:     F    -1.06      6 -0.512222222 -0.3121429
14:     F    -2.06      6 -0.512222222 -0.3121429
15:     F    -0.57      6 -0.512222222 -0.3121429
16:     F     1.70      6 -0.512222222 -0.3121429

This is a nicely thought through solution @MichaelChirico. – J..S Aug 10 '15 at 09:45 — J..S, Aug 10 '15 at 09:45

AntoniosK · Answer 3 · 2015-08-09T11:48:08.947

I can help you if you are willing to transform your data.table to a data.frame and do the process. Look at this example and execute the commands step by step to see how it works. This example refers to the case n = 3 you mentioned.

library(dplyr)

df <- data.frame(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), 
                 Variable = round(rnorm(16), 2))


df %>% group_by(Group) %>%
  do(data.frame(df2 = df)) %>%
  mutate(diff = as.numeric(Group) - as.numeric(df2.Group)) %>%
  filter(diff %in% 0:2) %>%
  mutate(unique_pairs = n_distinct(diff)) %>%
  filter(unique_pairs ==3) %>%
  mutate(Mean = mean(df2.Variable)) %>%
  filter(diff==0) %>%
  select(Group, Mean) %>%
  ungroup

The philosophy simply is to create all combinations between the "Group" names and then create some helpful columns to filter on. You can do this process with a for loop, but I expect it to be slower.

In case you really want to work with data.table (still dplyr but data.table structure in the background) try this:

library(dplyr)
library(data.table)

df <- data.table(Group = c(rep("A", 2), rep("B", 3), rep("C", 2), rep("D", 4), "E", rep("F", 4)), 
                     Variable = round(rnorm(16), 2))

df = df %>% mutate(Group2 = as.numeric(as.factor(Group)))

df %>% 
  group_by(Group2, Group) %>%
  do(data.table(df2 = df)) %>%
  mutate(diff = Group2 - df2.Group2) %>%
  filter(diff %in% 0:2) %>%
  group_by(Group2, Group) %>%
  mutate(unique_pairs = n_distinct(diff)) %>%
  filter(unique_pairs ==3) %>%
  group_by(Group2, Group) %>%
  mutate(Mean = mean(df2.Variable)) %>%
  filter(diff==0) %>%
  select(Group2, Group, Mean) %>%
  ungroup

Here the data.table doesn't like factors so I had to work with numbers instead of letters for the Group variable. Also, after every mutate I had to group again (this is a known dplyr issue when you want to work with a data.table in the background). The philosophy is exactly the same though.

*Also, after every mutate I had to group again (this is a known data.table issue)* - could you link the issue in data.table repo related to this? — jangorecki, Aug 09 '15 at 11:36
Yes, I've seen those: https://github.com/hadley/dplyr/issues/919 and http://stackoverflow.com/questions/31363269/different-behavior-for-group-by-for-data-table-vs-data-frame — AntoniosK, Aug 09 '15 at 11:40
I wouldn't then call it a *known data.table issue* but rather a *known dplyr issue*. — jangorecki, Aug 09 '15 at 11:42
You're right. It's a dplyr issue when having data.table datasets in the background. I'll correct it..... — AntoniosK, Aug 09 '15 at 11:47

data.table grouping across multiple sequential factors

3 Answers3