1

I want to do logistic regressions for several (n = 30) SNPs (coded as 0,1,2) as predictors and a casecontrol variable (0,1) as an outcome. As few of those rs are correlated, I cannot put all rs# in one model but have to run one at a time regression for each i.e., I cannot simply plus them together in one model like rs1 + rs2 + rs3 and so on....I need to have each regressed separately like below;

test1 = glm(casecontrol ~ rs1, data = mydata, family=binomial)

test2 = glm(casecontrol ~ rs2, data = mydata, family=binomial)

test3 = glm(casecontrol ~ rs3, data = mydata, family=binomial)

While I can run all the above regressions separately, is there a way to loop them together so I could get a summary() of all tests in one go?

I will have to adjust for age and sex too, but that would come after I run an unadjusted loop.

My data head from dput(head(mydata)) for example;

structure(list(ID = 1:6, sex = c(2L, 1L, 2L, 1L, 1L, 1L), age = c(52.4725405022036, 
58.4303618001286, 44.5300065923948, 61.4786037395243, 67.851808819687, 
39.7451378498226), bmi = c(31.4068751083687, 32.0614937413484, 
23.205021363683, 29.1445372393355, 32.6287483051419, 20.5887741968036
), casecontrol = c(0L, 1L, 0L, 1L, 1L, 1L), rs1 = c(1L, 0L, 2L, 
2L, 1L, 2L), rs2 = c(2L, 1L, 2L, 0L, 1L, 1L), rs3 = c(1L, 0L, 
1L, 2L, 2L, 2L), rs4 = c(1L, 1L, 1L, 1L, 0L, 2L), rs5 = c(1L, 
0L, 0L, 0L, 1L, 2L), rs6 = c(1L, 1L, 1L, 1L, 1L, 2L), rs7 = c(0L, 
0L, 0L, 0L, 0L, 0L), rs8 = c(0L, 0L, 1L, 0L, 2L, 1L), rs9 = c(0L, 
0L, 2L, 1L, 1L, 0L), rs10 = c(2L, 0L, 0L, 2L, 2L, 1L), rs11 = c(0L, 
1L, 1L, 0L, 1L, 1L), rs12 = c(1L, 2L, 0L, 1L, 2L, 2L), rs13 = c(0L, 
2L, 0L, 0L, 0L, 0L), rs14 = c(1L, 1L, 1L, 1L, 2L, 2L), rs15 = c(1L, 
2L, 1L, 1L, 0L, 1L), rs16 = c(0L, 2L, 1L, 2L, 2L, 1L), rs17 = c(0L, 
2L, 1L, 1L, 2L, 2L), rs18 = c(1L, 2L, 2L, 1L, 1L, 1L), rs19 = c(1L, 
1L, 0L, 1L, 2L, 2L), rs20 = c(2L, 1L, 0L, 2L, 2L, 1L), rs21 = c(1L, 
2L, 2L, 1L, 1L, 0L), rs22 = c(1L, 1L, 2L, 2L, 0L, 1L), rs23 = c(2L, 
0L, 2L, 1L, 1L, 1L), rs24 = c(0L, 0L, 0L, 2L, 2L, 2L), rs25 = c(2L, 
2L, 1L, 1L, 0L, 1L), rs26 = c(1L, 1L, 0L, 2L, 0L, 1L), rs27 = c(1L, 
1L, 1L, 1L, 0L, 1L), rs28 = c(0L, 1L, 1L, 2L, 0L, 2L), rs29 = c(2L, 
2L, 2L, 2L, 1L, 2L), rs30 = c(0L, 2L, 1L, 2L, 1L, 0L)), row.names = c(NA, 
6L), class = "data.frame")```
Tabbi
  • 69
  • 1
  • 9
  • there are lots of answered questions of this general type on SO already - most are about linear regression but the general method is the same, something like this: https://stackoverflow.com/questions/28373227/r-regressions-in-a-loop – Ben Bolker Jun 25 '20 at 11:06
  • You can try the `logistic.display` function from the epiDisplay package. You can add all predictors in one model and the output shows crude ORs, CIs and p-values (together with the adjusted results) and p-values from the LR test. The function does the looping for you. – Edward Jun 25 '20 at 11:13

2 Answers2

2

Probably you want something like this:

lapply(1:30, function(i) glm(as.formula(paste0('casecontrol ~ ', 'rs', i)), data = mydata, family = binomial))

which will execute 30 logistic regressions with the selected predictor.

Instead of hard coding the overall number of predictors, you can use: sum(grepl('rs', names(mydata))), which will return 30.

You can use tidy function from broom package to get the summary in a tidy format.

purrr::map_dfr(1:30, function(i) data.frame(model = i, tidy(glm(as.formula(paste0('casecontrol ~ ', 'rs', i)), data = mydata, family = binomial))))

or you can do this in a more dynamic way:

names(mydata)[grepl('rs', names(mydata))] -> pred #get all predictors that contain 'rs'

purrr::map_dfr(1:length(pred), 
               function(i) data.frame(model = i, 
                                      tidy(glm(as.formula(paste0('casecontrol ~ ', pred[i])), data = mydata, family = binomial))))

If you want to include another variable, you simply need to adjust the pred vector.

c(pred, paste0(pred, ' + age')) -> pred #interaction between rs drivers and age

or

c(pred, paste0(pred, ' + age + sex')) -> pred #interaction between rs drivers age and sex
AlexB
  • 3,061
  • 2
  • 17
  • 19
  • Doing this returns results as follows for each SNP (or rs#s); `Call: glm(formula = as.formula(paste0("casecontrol ~ ", "rs", i)), family = binomial, data = mydata) Coefficients: (Intercept) rs1 -0.11215 -0.09693 Degrees of Freedom: 499 Total (i.e. Null); 498 Residual Null Deviance: 687.7 Residual Deviance: 687 AIC: 691` How do I get complete beta estimate, SE, p vlaues summary for each rs? – Tabbi Jun 25 '20 at 14:14
  • Thank you, looks great. Would the model work if my rs are not named from 1 to 30 and instead each rs has different name e.g., `rs1234_T rs4567_C rs1471633_C rs11264341_T rs2078267_T rs478607_A rs642803_T rs653178_T rs3741414_T rs1394125_A rs6598541_G rs7193778_T rs7188445_A rs7224610_A rs2079742_C rs164009_G rs17050272_A rs2307394_C` How do I insert such rs# in the model you suggested please? And where should I add factors to adjust for? like age + sex? – Tabbi Jun 25 '20 at 18:43
  • 1
    @Tabbi, check my edit. Please accept the answer if this is in line with your expectations. – AlexB Jun 25 '20 at 18:45
  • Thank you. Can you help me with the last bit about adding factors to adjust in the model e.g., age + sex and then I can get the adjusted results with the same command I believe?! – Tabbi Jun 25 '20 at 18:48
  • Thank you so very much. You are great!! – Tabbi Jun 25 '20 at 19:10
  • Considering I want to run regression in controls (casecontrol==0) males (SEX==1) only. Adding subsets in `lm` model as follows would work fine? My outcome this time is continuous. `purrr::map_dfr(1:length(pred), function(i) data.frame(model = i, tidy(lm(as.formula(paste0('outcome ~ ', pred[i])), subset=casecontrol==0 & SEX==1, data = mydata))))` – Tabbi Aug 18 '20 at 14:33
  • Can you please see my comment above? Thank you – Tabbi Aug 19 '20 at 10:45
0

you can do something like this

outcome<-mydata %>% select("casecontrol")  #select outcome

features <- mydata %>% select("rs1":"rs30") #select features

features_names<-data.frame(colnames(features)) #get feature names

for(i in 1:nrow(features_names))     # loop over length of features_name
{selected=features[,i,drop=FALSE]    #get each feature 
total <- cbind(selected, response)   # combine with outcome
glm.fit <- glm(casecontrol ~ ., data = total, family = binomial("logit")) 
summary(glm.fit)
}