R: Create multiple columns in list of data.tables and modify them with condition on a grouping variable

Question

I have a list of data tables that looks like like this:

group1 <- data.table(
    group = rep(x = c("group1"), each = 16),
    amount = rep(x = 7:4, each = 4),
    subgr = rep(x = 1:2, each = 8),    
    ind = rep(x = 0:1, each = 4, times = 2)
  )

group2 <- data.table(
    group = rep(x = c("group2"), each = 36),
    amount = rep(x = 13:8, each = 6),
    subgr = rep(x = 1:3, each = 12),
    ind = rep(x = 0:1, each = 6, times = 3)
  )

mydt <- rbind(group1, group2)

mydt <- lapply(X = split(x = 1:nrow(mydt), f = mydt[["group"]]),
FUN = function(i)mydt[i])

The object presented above is oversimplified, the actual list contains much more and much bigger data.tables, each with slightly different structure in terms of number of rows distributed across subgr and number of subgr themselves. What I want to achieve is:

Create multiple columns in each data.table in the list that is equal to the number of unique values in subgr. Each new column is a copy of the amount. The number of copied columns will be equal to the number of unique values in subgr.
Modify the newly created columns within each subgr (say amount*2 if ind == 1 and amount*4 if ind ==0), leaving the rest of the values in the subgroups in subgr unaffected.

That is, to have something like this (only mydt$group1 shown here, but it applies for all tables):

$group1
     group amount subgr ind am1 am2
 1: group1      7     1   0  28   7
 2: group1      7     1   0  28   7
 3: group1      7     1   0  28   7
 4: group1      7     1   0  28   7
 5: group1      6     1   1  12   6
 6: group1      6     1   1  12   6
 7: group1      6     1   1  12   6
 8: group1      6     1   1  12   6
 9: group1      5     2   0   5  20
10: group1      5     2   0   5  20
11: group1      5     2   0   5  20
12: group1      5     2   0   5  20
13: group1      4     2   1   4   8
14: group1      4     2   1   4   8
15: group1      4     2   1   4   8
16: group1      4     2   1   4   8

I know that splitting a data.table into list of data.tables is not a good idea, as mentioned in this postbut this is how the object is. Besides that, the split is related with the task I need to perform:

The data tables contain different number of rows.
The rows are grouped into subgroups defined by subgr and their number also differs across the different data tables, i.e. the number of new columns will differ across the entire list.

That is, the entire data.table can't be processed at once because different number of columns will be created for each group in the group variable.

What I tried so far is writing a function using the second solution in the accepted answer from this post:

myfun <- function(data, quantity, region, index) {
  data <- lapply(data, function(i) {
    i[ , eval(paste0("am", unique(i[[region]]))) := i[[quantity]]]
  })
  data <- lapply(X = data, FUN = function(i) {
    rep.names <- paste0("am", unique(i[[region]]))
    i[ , eval(rep.names) := lapply(.SD, function(j) {
      ifelse(i[["ind"]] == 1L, j*2L, j*4L)
      }), by = region, .SDcols = rep.names]
  })
  return(data)
}

myfun(mydt, quantity = "amount", region = "subgr", index = "ind")

It does not work as intended, it modifies the whole range of values within all variables according to the condition. It throws warnings, though, which point the problem. Here is just the first warning, the other are the same:

Warning messages:
1: In `[.data.table`(i, , `:=`(eval(rep.names), lapply(.SD,  ... :
  RHS 1 is length 16 (greater than the size (8) of group 1). The last
8 element(s) will be discarded.

That is, it uses just the rows it has to on the LHS, but then takes the entire column for the RHS. Apparently I am missing something important here. The difference with the second solution from the accepted answer from [this post][3] is that there are multiple columns to use, while in my case is just one (amount).

Can someone help?

Since data.table 1.9.7 you can split with `split(mydt, by="group")`. If you have more than ~1000 unique values `subgr` you should also use `alloc.col` before making `:=`. — jangorecki, Jun 28 '16 at 17:56
@jangorecki: Thank you, this is useful information. Version 1.9.7 is not out yet, would like to learn more about this functionality. Can you provide a bit more details on the `alloc.col`? — panman, Jun 29 '16 at 14:18
maybe [`?alloc.col`](http://jangorecki.gitlab.io/data.table/library/data.table/html/truelength.html) — jangorecki, Jun 29 '16 at 16:12

score 3 · Accepted Answer · answered Jul 03 '16 at 02:10

I would suggest that this is a task for which a for-loop is appropriate. You can iterate over the list and modify each data.table in-place, without having to rebuild the list, which is what lapply() does.

Additionally, I would recommend you construct the am* columns in a matrix before having them assigned to the target data.table. By passing amount as the underlying data vector, we can get most of the way towards completing all of the am* columns in one shot, since most cells take their value directly from the amount column without any change, especially if there are many unique subgr values. Afterward we can selectively modify the cells that must be changed by index-assigning the data matrix with an index matrix. Building the index matrix will be fairly easy, since we know only one cell per row must be changed. Basically we can cbind() the row index sequence .I with the required column indexes as computed from match(subgr,grs) where grs is the unique set of subgr values. This will be more efficient than doing an equality comparison like j==i[[region]] for every am* column.

for (i in seq_along(mydt)) {
    grs <- unique(mydt[[i]]$subgr);
    mydt[[i]][,paste0('am',grs):={
        m <- matrix(amount,.N,length(grs));
        m[cbind(.I,match(subgr,grs))] <- amount*ifelse(ind==1L,2L,4L);
        as.data.frame(m);
    }];
}; ## end for
mydt;
## $group1
##      group amount subgr ind am1 am2
##  1: group1      7     1   0  28   7
##  2: group1      7     1   0  28   7
##  3: group1      7     1   0  28   7
##  4: group1      7     1   0  28   7
##  5: group1      6     1   1  12   6
##  6: group1      6     1   1  12   6
##  7: group1      6     1   1  12   6
##  8: group1      6     1   1  12   6
##  9: group1      5     2   0   5  20
## 10: group1      5     2   0   5  20
## 11: group1      5     2   0   5  20
## 12: group1      5     2   0   5  20
## 13: group1      4     2   1   4   8
## 14: group1      4     2   1   4   8
## 15: group1      4     2   1   4   8
## 16: group1      4     2   1   4   8
##
## $group2
##      group amount subgr ind am1 am2 am3
##  1: group2     13     1   0  52  13  13
##  2: group2     13     1   0  52  13  13
##  3: group2     13     1   0  52  13  13
##  4: group2     13     1   0  52  13  13
##  5: group2     13     1   0  52  13  13
##  6: group2     13     1   0  52  13  13
##  7: group2     12     1   1  24  12  12
##  8: group2     12     1   1  24  12  12
##  9: group2     12     1   1  24  12  12
## 10: group2     12     1   1  24  12  12
## 11: group2     12     1   1  24  12  12
## 12: group2     12     1   1  24  12  12
## 13: group2     11     2   0  11  44  11
## 14: group2     11     2   0  11  44  11
## 15: group2     11     2   0  11  44  11
## 16: group2     11     2   0  11  44  11
## 17: group2     11     2   0  11  44  11
## 18: group2     11     2   0  11  44  11
## 19: group2     10     2   1  10  20  10
## 20: group2     10     2   1  10  20  10
## 21: group2     10     2   1  10  20  10
## 22: group2     10     2   1  10  20  10
## 23: group2     10     2   1  10  20  10
## 24: group2     10     2   1  10  20  10
## 25: group2      9     3   0   9   9  36
## 26: group2      9     3   0   9   9  36
## 27: group2      9     3   0   9   9  36
## 28: group2      9     3   0   9   9  36
## 29: group2      9     3   0   9   9  36
## 30: group2      9     3   0   9   9  36
## 31: group2      8     3   1   8   8  16
## 32: group2      8     3   1   8   8  16
## 33: group2      8     3   1   8   8  16
## 34: group2      8     3   1   8   8  16
## 35: group2      8     3   1   8   8  16
## 36: group2      8     3   1   8   8  16
##      group amount subgr ind am1 am2 am3
##

Benchmarking

library(microbenchmark);
library(data.table);

hubert <- function(mydt) { myfun <- function(data, quantity, region, index) lapply(data, function(i) i[ , eval(paste0("am", unique(i[[region]]))) := lapply(unique(i[[region]]), function(j) {i[[quantity]]*ifelse(j==i[[region]],ifelse(ind==1, 2, 4), 1)})] ); myfun(mydt, quantity = "amount", region = "subgr", index = "ind"); };
bgoldst <- function(mydt) { for (i in seq_along(mydt)) { grs <- unique(mydt[[i]]$subgr); mydt[[i]][,paste0('am',grs):={ m <- matrix(amount,.N,length(grs)); m[cbind(.I,match(subgr,grs))] <- amount*ifelse(ind==1L,2L,4L); as.data.frame(m); }]; }; mydt; };

## OP's example
group1 <- data.table(group=rep(x=c("group1"),each=16),amount=rep(x=7:4,each=4),subgr=rep(x=1:2,each=8),ind=rep(x=0:1,each=4,times=2));
group2 <- data.table(group=rep(x=c("group2"),each=36),amount=rep(x=13:8,each=6),subgr=rep(x=1:3,each=12),ind=rep(x=0:1,each=6,times=3));
mydt <- rbind(group1,group2);
mydt <- lapply(X=split(x=1:nrow(mydt),f=mydt[["group"]]),FUN=function(i)mydt[i]);

ex <- hubert(lapply(mydt,copy));
all.equal(ex,bgoldst(lapply(mydt,copy)));
## [1] TRUE

microbenchmark(hubert(lapply(mydt,copy)),bgoldst(lapply(mydt,copy)));
## Unit: milliseconds
##                         expr      min       lq     mean   median       uq      max neval
##   hubert(lapply(mydt, copy)) 2.579173 2.632417 2.837445 2.669621 2.736549 6.555914   100
##  bgoldst(lapply(mydt, copy)) 2.603977 2.683092 2.880715 2.723078 2.781025 4.376168   100

## scale test
set.seed(1L);
NR <- 1e5L; NGRP <- 1e3L; NAMT <- 30L; NSUBGR <- 30L;
mydt <- data.table(group=paste0('group',sample(NGRP,NR,T)),amount=sample(NAMT,NR,T),subgr=sample(NSUBGR,NR,T),ind=sample(0:1,NR,T));
mydt <- split(mydt,mydt$group);

ex <- hubert(lapply(mydt,copy));
all.equal(ex,bgoldst(lapply(mydt,copy)));
## [1] TRUE

microbenchmark(hubert(lapply(mydt,copy)),bgoldst(lapply(mydt,copy)));
## Unit: seconds
##                         expr      min       lq     mean   median       uq      max neval
##   hubert(lapply(mydt, copy)) 2.831080 2.899419 2.938751 2.935096 2.970701 3.110481   100
##  bgoldst(lapply(mydt, copy)) 1.571023 1.647102 1.674666 1.671877 1.709434 1.845174   100

Thank you bgoldst, your answer is very well explained. – panman Jul 03 '16 at 10:23 — panman, Jul 03 '16 at 10:23

HubertL · Answer 2 · 2016-06-28T18:41:07.783

Your error come from the length of i[["ind"]] that contains all the rows from the dataset while j only contains rows from the group:

ifelse(i[["ind"]] == 1L, j*2L, j*4L)

There are several possibilities to fix this and achieve your goal, and this is how I would do it:

myfun <- function(data, quantity, region, index) {
        lapply(data, function(i) {
                i[ , eval(paste0("am", unique(i[[region]]))) := lapply(unique(i[[region]]), function(j)
                        {i[[quantity]]*ifelse(j==i[[region]],ifelse(ind==1, 2, 4), 1)})]
        })
}
myfun(mydt, quantity = "amount", region = "subgr", index = "ind")
$group1
     group amount subgr ind am1 am2
 1: group1      7     1   0  28   7
 2: group1      7     1   0  28   7
 3: group1      7     1   0  28   7
 4: group1      7     1   0  28   7
 5: group1      6     1   1  12   6
 6: group1      6     1   1  12   6
 7: group1      6     1   1  12   6
 8: group1      6     1   1  12   6
 9: group1      5     2   0   5  20
10: group1      5     2   0   5  20
11: group1      5     2   0   5  20
12: group1      5     2   0   5  20
13: group1      4     2   1   4   8
14: group1      4     2   1   4   8
15: group1      4     2   1   4   8
16: group1      4     2   1   4   8

$group2
     group amount subgr ind am1 am2 am3
 1: group2     13     1   0  52  13  13
 2: group2     13     1   0  52  13  13
 3: group2     13     1   0  52  13  13
 4: group2     13     1   0  52  13  13
 5: group2     13     1   0  52  13  13
 6: group2     13     1   0  52  13  13
 7: group2     12     1   1  24  12  12
 8: group2     12     1   1  24  12  12
 9: group2     12     1   1  24  12  12
10: group2     12     1   1  24  12  12
11: group2     12     1   1  24  12  12
12: group2     12     1   1  24  12  12
13: group2     11     2   0  11  44  11
14: group2     11     2   0  11  44  11
15: group2     11     2   0  11  44  11
16: group2     11     2   0  11  44  11
17: group2     11     2   0  11  44  11
18: group2     11     2   0  11  44  11
19: group2     10     2   1  10  20  10
20: group2     10     2   1  10  20  10
21: group2     10     2   1  10  20  10
22: group2     10     2   1  10  20  10
23: group2     10     2   1  10  20  10
24: group2     10     2   1  10  20  10
25: group2      9     3   0   9   9  36
26: group2      9     3   0   9   9  36
27: group2      9     3   0   9   9  36
28: group2      9     3   0   9   9  36
29: group2      9     3   0   9   9  36
30: group2      9     3   0   9   9  36
31: group2      8     3   1   8   8  16
32: group2      8     3   1   8   8  16
33: group2      8     3   1   8   8  16
34: group2      8     3   1   8   8  16
35: group2      8     3   1   8   8  16
36: group2      8     3   1   8   8  16

Actually the fourth line in the function you provided should say `{i[[quantity]]*ifelse(j==i[[region]],ifelse(i[[index]]==1, 2, 4), 1)})]`. As it is now it uses the actual variable name and does not match it as a function argument. — panman, Jun 29 '16 at 22:06

R: Create multiple columns in list of data.tables and modify them with condition on a grouping variable

2 Answers2

Benchmarking