For loops to run models, subset a new data set and assign column in new data in R

Question

I have this hypothetical data:

structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 
3, 3), A = c(1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0), B = c(0, 
0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1), C = c(1, 1, 1, 0, 
1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1), M = c(1, 2, 3, 4, 1, 2, 3, 
4, 5, 1, 2, 3, 4, 5, 6)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -15L))

I am trying to run a loop that will create a subset based on each value in column M and then create a column in that subset. But I am having trouble with assign()

What I want is to generate a subset M_i for each value of M and then create a column in that subset data M_i$ps_i for a glm prediction. Another issue is not all ID have the same number of M and I'm not sure whether that needs to be considered (and only 1 ID with M=6). I've tried below but can't figure out assign syntax:

n<-max(dat$M)
for (ii in 1:n){
  glm<-glm(A~B+C, data=dat,subset=(M==ii)), family=binomial)
  M_ii<-subset(dat,M==ii)
  assign(paste0("M_",ii, value = M_ii)
  M_ii$ps_ii<-predict(glm,type='response') 
  assign(paste0("ps_ii",ii,value = predict(glm,type='response') )
}

`data=dat,subset=(M==ii))` : this part has an incorrect syntax — Julien, Mar 16 '23 at 13:13
Use lapply instead of forloop, then there will be no need for `assign`. All the output would be in one list. — zx8754, Mar 16 '23 at 13:18

score 2 · Answer 1 · answered Mar 16 '23 at 13:28

Use lapply instead of forloop, then there will be no need for assign. All the output would be in one list.

res <- lapply(split(dat, dat$M), function(i){
  cbind(i, 
        ps = predict(
          glm(A ~ B + C, data = i, family = binomial),
          type = "response"))
  })

res
# $`1`
#   ID A B C M        ps
# 1  1 1 0 1 1 0.6666667
# 2  2 0 0 1 1 0.6666667
# 3  3 1 0 1 1 0.6666667
# 
# $`2`
#   ID A B C M           ps
# 1  1 0 0 1 2 5.826215e-11
# 2  2 0 1 1 2 5.826215e-11
# 3  3 1 0 0 2 1.000000e+00
# 
#...etc

zephryl · Accepted Answer · 2023-03-16T13:31:24.420

Just move your first assign() call down one line, after you add the prediction column to M_ii. I would also use unique(dat$M) instead of 1:max(dat$M), as the latter will fail if there are numbers in the sequence with no observations.

for (ii in unique(dat$M)){
  glm<-glm(A~B+C, data=dat,subset=(M==ii), family=binomial)
  M_ii<-subset(dat,M==ii)
  M_ii$ps_ii<-predict(glm,type='response') 
  assign(paste0("M_",ii), value = M_ii)
}

Results:

#> M_1
   ID A B C M     ps_ii
1   1 1 0 1 1 0.6666667
5   2 0 0 1 1 0.6666667
10  3 1 0 1 1 0.6666667
#> M_6
   ID A B C M        ps_ii
15  3 0 1 1 6 1.583729e-10

As you can see, it does give you a prediction when there’s only one case. But this won’t ever be very informative due to perfect fit, and I would be cautious about using it depending on your application.

And as long as I’m dispensing unsolicited advice, it may be better to create a list of dataframes as demonstrated by @zx8754 rather than assigning into the global environment.

For loops to run models, subset a new data set and assign column in new data in R

2 Answers2