0

I am working on some simulations in R. However, my code runs very slow, making it unsuitable for high N and high number of sims s. I've identified the bottleneck as the following function (see below). It takes a list of s tibbles as input. Basically, within group g (two rows per g_id) I want to assign the values of my parameters (probability) to the first row within the group (g_first == 1), given that the values for the variables A and B match specific patterns within g (i.e. handcoding an interaction between two categorical variables). The values of A and B are also stored as 1/0 variables (e.g. B_x takes a value of 1 if B is x).

I tried to use a list of data.tables instead of tibbles as input but it actually slowed it down further (maybe bc the subsetting below is slow for data.tables. Further, I couldnt figure out how to setkey() for a list of data.tables).

Per request, I uploaded a sample df (list with 100 simulated tibbles) under this link: https://ufile.io/dsax7us8 . Call the function via lapply(df, assign_p) . In the code below, you also find a quick glance on one of the elements of the df list.

#a quick glance on the data structure
dput(head(df[[1]]))

structure(list(p_id = c(1L, 1L, 1L, 1L, 2L, 2L), g_id = c(1L, 
1L, 2L, 2L, 3L, 3L), pr_id = 1:6, g_first = c(1, 0, 1, 0, 1, 
0), A = c(1, 0, 1, 0, 0, 1), B = c("x", "z", "x", "x", "z", "z"
), C = c(1, 1, 1, 1, 1, 0), A_0 = c(0, 1, 0, 1, 1, 0), A_1 = c(1, 
0, 1, 0, 0, 1), B_x = c(1, 0, 1, 1, 0, 0), B_z = c(0, 1, 0, 0, 
1, 1), C_0 = c(0, 0, 0, 0, 0, 1), C_1 = c(1, 1, 1, 1, 1, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

dput(head(df[[2]]))

structure(list(p_id = c(1L, 1L, 1L, 1L, 2L, 2L), g_id = c(1L, 
1L, 2L, 2L, 3L, 3L), pr_id = 1:6, g_first = c(1, 0, 1, 0, 1, 
0), A = c(0, 0, 1, 0, 1, 0), B = c("x", "z", "z", "x", "x", "z"
), C = c(1, 1, 1, 0, 1, 0), A_0 = c(1, 1, 0, 1, 0, 1), A_1 = c(0, 
0, 1, 0, 1, 0), B_x = c(1, 0, 0, 1, 1, 0), B_z = c(0, 1, 1, 0, 
0, 1), C_0 = c(0, 0, 0, 1, 0, 1), C_1 = c(1, 1, 1, 0, 1, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

#params
A_10_B_xx <- 0.04 
A_10_B_xz <- 0.04
A_10_B_zz <- 0.003
A_10_B_zx <- 0.003

A_01_B_xx <- -0.04
A_01_B_xz <- -0.04
A_01_B_zz <- -0.003
A_01_B_zx <- -0.003


assign_p <- function(df) {


 df$p <- NA_real_
 
 g <- max(df$g_id)
 
 
 for (i in 1:g) {

#generate logical checks
   A_11 = df[df$g_id == i & df$g_first == 1, "A_1"]
   A_10 = df[df$g_id == i & df$g_first == 0, "A_1"]
   A_01 = df[df$g_id == i & df$g_first == 1, "A_0"]
   A_00 = df[df$g_id == i & df$g_first == 0, "A_0"]
   B_x1 = df[df$g_id == i & df$g_first == 1, "B_x"]
   B_x0 = df[df$g_id == i & df$g_first == 0, "B_x"]
   B_z1 = df[df$g_id == i & df$g_first == 1, "B_z"]
   B_z0 = df[df$g_id == i & df$g_first == 0, "B_z"]
  
#multiply logical checks
   A_10_B_xx_out  <- A_10_B_xx * A_11 * A_00 * B_x1 * B_x0
   
   A_10_B_xz_out  <- A_10_B_xz * A_11 * A_00 * B_x1 * B_z0
   
   A_10_B_zz_out  <- A_10_B_zz * A_11 * A_00 * B_z1 * B_z0
   
   A_10_B_zx_out  <- A_10_B_zx * A_11 * A_00 * B_z1 * B_x0
   
   A_01_B_xx_out  <- A_01_B_xx * A_01 * A_10 * B_x1 * B_x0
   
   A_01_B_xz_out  <- A_01_B_xz * A_01 * A_10 * B_x1 * B_z0
   
   A_01_B_zz_out  <- A_01_B_zz * A_01 * A_10 * B_z1 * B_z0
   
   A_01_B_zx_out  <- A_01_B_zx * A_01 * A_10 * B_z1 * B_x0
   
   
#add matches to 0.5 and assign to column p    
   df[df$g_id == i &
        df$g_first == 1, "p"] <- 0.50 +
     A_10_B_xx_out +
     A_10_B_xz_out +
     A_10_B_zz_out +
     A_10_B_zx_out +
     A_01_B_xx_out +
     A_01_B_xz_out +
     A_01_B_zz_out +
     A_01_B_zx_out
 }
 

 df 
}


persephone
  • 380
  • 2
  • 10
  • 2
    Welcome to stackoverflow! Please provide us with some dummy data (e.g. using `dput`). It might be beneficial to `rbindlist` the data.tables before working on them. Furthermore, you might want to check `fcase()` available since [data.table v1.13.0 (24 Jul 2020)](https://github.com/Rdatatable/data.table/blob/master/NEWS.md#datatable-v1130--24-jul-2020) – ismirsehregal Sep 03 '20 at 11:12
  • Thanks for the update - but can you please just paste the output of `dput(my_tibble_1)` or `dput(head(my_tibble_1))` so we can see the structure without downloading files from external sources? Thank you! [Here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) you can find some related info. – ismirsehregal Sep 03 '20 at 12:20
  • Can you please elaborate on the pattern A (1,0) B (x,z) have to match within each g_id? I think a big part of your code can be reduced to a simple aggregation by group (Please see chapter 2 [here](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)). – ismirsehregal Sep 03 '20 at 13:19
  • I want to code interaction effects: probability of chosing g (group_id==1 or group_id==0) given that A and B take on a particular combination of values. Y would be generated in the next step using the p generated here and is omitted here as it seems unrelated to the problem (i.e. the bottleneck is elsewhere). Thanks for these hints, i'll check them out and try around a bit. – persephone Sep 03 '20 at 13:29
  • `given that A and B take on a particular combination of values` - which combinations? – ismirsehregal Sep 03 '20 at 13:30
  • e.g. A_10_B_xx, meaning: A = 1 if g_first ==1; A = 0 if g_first == 0; B==x if g_first==1; B==x if g_first==0 – persephone Sep 03 '20 at 13:34
  • 2
    I'd create a look up table with your params and join it with the current dataset and after that do the calculations. – ismirsehregal Sep 03 '20 at 13:46
  • That might actually be a good idea! Perhaps I was thinking overly complicated - ill check it out. Thanks! – persephone Sep 03 '20 at 13:48
  • I would start by consolidating a lot of your variables into vectors or lists just for readability. Replace all your `A*_B_*` with a single object, for example, and index into it. Next, your logical checks should use short-circuit `&&` rather than `&` . – Carl Witthoft Sep 03 '20 at 15:33
  • Please check my answer. - If you also provide me with `dput(head(df[[2]]))` I can give an example for your list. – ismirsehregal Sep 04 '20 at 07:17
  • Thanks, I figured it out - nonetheless I provided a secon dput. Overall, your solution provides huge speed up! – persephone Sep 04 '20 at 12:12
  • Glad it helped! I just updated my answer with an approach for the list input (might be helpful in comparison to your code). – ismirsehregal Sep 04 '20 at 12:32

1 Answers1

1

Here is an example based on the single data.frame you provided. As mentioned in the comments I've created a look-up table and tried to create a matching column (param_group) based on the logic you provided (not sure if it is correct) to join both tables.

Please check the following:

library(data.table)

DT <- structure(list(p_id = c(1L, 1L, 1L, 1L, 2L, 2L),
                          g_id = c(1L, 1L, 2L, 2L, 3L, 3L), 
                          pr_id = 1:6,
                          g_first = c(1, 0, 1, 0, 1, 0),
                          A = c(1, 0, 1, 0, 0, 1),
                          B = c("x", "z", "x", "x", "z", "z"),
                          C = c(1, 1, 1, 1, 1, 0), A_0 = c(0, 1, 0, 1, 1, 0),
                          A_1 = c(1, 0, 1, 0, 0, 1),
                          B_x = c(1, 0, 1, 1, 0, 0),
                          B_z = c(0, 1, 0, 0, 1, 1), 
                          C_0 = c(0, 0, 0, 0, 0, 1), 
                          C_1 = c(1, 1, 1, 1, 1, 0)), 
                     row.names = c(NA, -6L), 
                     class = c("tbl_df", "tbl", "data.frame"))

setDT(DT)

parameterDT <- data.table(
  param_group = c("A_10_B_xx", "A_10_B_xz", "A_10_B_zz", "A_10_B_zx", "A_01_B_xx", "A_01_B_xz", "A_01_B_zz", "A_01_B_zx"),
  parameter = c(0.04, 0.04, 0.003, 0.003, -0.04, -0.04, -0.003, -0.003)
)

DT[, param_group := paste0("A_", paste0(.SD$A, collapse =  ""), "_B_", paste0(.SD$B, collapse =  "")), by = g_id, .SDcols = c("A", "B")]
DT <- parameterDT[DT, on = "param_group"]

DT[g_first == 1, p := 0.5 + parameter]
DT

   param_group parameter p_id g_id pr_id g_first A B C A_0 A_1 B_x B_z C_0 C_1     p
1:   A_10_B_xz     0.040    1    1     1       1 1 x 1   0   1   1   0   0   1 0.540
2:   A_10_B_xz     0.040    1    1     2       0 0 z 1   1   0   0   1   0   1    NA
3:   A_10_B_xx     0.040    1    2     3       1 1 x 1   0   1   1   0   0   1 0.540
4:   A_10_B_xx     0.040    1    2     4       0 0 x 1   1   0   1   0   0   1    NA
5:   A_01_B_zz    -0.003    2    3     5       1 0 z 1   1   0   0   1   0   1 0.497
6:   A_01_B_zz    -0.003    2    3     6       0 1 z 0   0   1   0   1   1   0    NA

As also mentioned before, regarding your list of tables I'd recommend using rbindlist to create a single data.table.

Edit: Here is how I'd approach this with a list of data.frames as input:

library(data.table)

df <- list(structure(
  list(
    p_id = c(1L, 1L, 1L, 1L, 2L, 2L),
    g_id = c(1L, 1L, 2L, 2L, 3L, 3L),
    pr_id = 1:6,
    g_first = c(1, 0, 1, 0, 1, 0),
    A = c(1, 0, 1, 0, 0, 1),
    B = c("x", "z", "x", "x", "z", "z"),
    C = c(1, 1, 1, 1, 1, 0),
    A_0 = c(0, 1, 0, 1, 1, 0),
    A_1 = c(1, 0, 1, 0, 0, 1),
    B_x = c(1, 0, 1, 1, 0, 0),
    B_z = c(0, 1, 0, 0, 1, 1),
    C_0 = c(0, 0, 0, 0, 0, 1),
    C_1 = c(1, 1, 1, 1, 1, 0)
  ),
  row.names = c(NA,-6L),
  class = c("tbl_df", "tbl", "data.frame")
),
structure(
  list(
    p_id = c(1L, 1L, 1L, 1L, 2L, 2L),
    g_id = c(1L, 1L, 2L, 2L, 3L, 3L),
    pr_id = 1:6,
    g_first = c(1, 0, 1, 0, 1, 0),
    A = c(0, 0, 1, 0, 1, 0),
    B = c("x", "z", "z", "x", "x", "z"),
    C = c(1, 1, 1, 0, 1, 0),
    A_0 = c(1, 1, 0, 1, 0, 1),
    A_1 = c(0, 0, 1, 0, 1, 0),
    B_x = c(1, 0, 0, 1, 1, 0),
    B_z = c(0, 1, 1, 0, 0, 1),
    C_0 = c(0, 0, 0, 1, 0, 1),
    C_1 = c(1, 1, 1, 0, 1, 0)
  ),
  row.names = c(NA,-6L),
  class = c("tbl_df", "tbl", "data.frame")
))

parameterDT <- data.table(
  param_group = c("A_10_B_xx", "A_10_B_xz", "A_10_B_zz", "A_10_B_zx", "A_01_B_xx", "A_01_B_xz", "A_01_B_zz", "A_01_B_zx"),
  parameter = c(0.04, 0.04, 0.003, 0.003, -0.04, -0.04, -0.003, -0.003)
)

DT <- rbindlist(df, idcol = TRUE)

DT[, param_group := paste0("A_", paste0(.SD$A, collapse =  ""), "_B_", paste0(.SD$B, collapse =  "")), by = .(.id, g_id), .SDcols = c("A", "B")]
DT <- parameterDT[DT, on = "param_group"]

DT[g_first == 1, p := 0.5 + parameter]
print(DT)

# back to a list of data.tables
# split(DT, by = ".id")
ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
  • One small question: if I merge everything to a single data frame, I naturally face a classic time/memory trade-off as my number of simulations increase (e.g. 10k dts and 200 million obs in total). So looping for every dt, calculating what i need to calculate, and saving only the few model outputs (which is the next step of this code snippet) should be the go to i guess. Is there a way to do this efficiently in the data.table context? – persephone Sep 06 '20 at 07:33
  • @BobD. Sorry but the question isn't clear to me. Of course, merging the data.frames has it's limits. I don't really know your data and the setup you are working on. Nevertheless, reducing calculations to the needed minimum sounds like a good idea. – ismirsehregal Sep 08 '20 at 07:17