1

I have to build up a formula for a linear regression model (using glm() function), where I have too many variables to try. I am doing gene expression analysis. So, what I'm looking for is a way to concatenate all those variables in a single string (in this case, the variables would be the column names of my data.frame), so the formula can be achieved.

My data looks something like this (the actual data frame has 213 columns):

> df
         Smoke    PRR22 C15orf40     RAX2   GIMAP1    TM2D3 FAM167AAS1 LINC00161    SMCR8  CYP11B1
DP019     No 6.247058 4.609030 4.920439 3.531275 6.032196   1.576602  3.261709 5.752494 4.082924
DP021    Yes 5.767487 4.451362 4.834086 3.054192 6.049870   1.779412  2.618781 5.291328 4.274439
DP022     No 6.008855 4.841719 4.834774 3.354556 6.244215   1.580933  3.135989 4.989184 3.319836
DP025    Yes 5.390064 4.420183 4.923600 3.356938 5.516580   1.796413  2.984576 5.189582 3.833807
DP033     No 5.479384 5.987276 4.858381 3.454082 7.176767   1.640109  3.213976 5.378756 4.195856
DP035     No 5.439995 4.825332 5.469710 3.561561 6.357713   1.684058  3.635607 4.786237 3.792060

Where the first column ("Smoke") is my trait variable and the rest (gene names) are the gene expression level.

I would like to build something like this:

glm(Smoke ~ PRR22 + C15orf40 + RAX2 + GIMAP1... and so forth

My question is: how can I automate it in a way I have all my variables there?

Maybe concatenating the columns name in one string would solve the problem? For example:

for (i in colnames(df)[-1]){
    form <- as.formula(paste0("Smoke ~ ", i))
    glm(form, data=df)
    }

But it is not working. I am sure I am missing something... or a lot. So, if anyone could help, that would be excellent!

Douglas
  • 185
  • 1
  • 7
  • 2
    Maybe I got it wrong but aren't you looking for `~.`, no? – NelsonGon Apr 10 '19 at 08:50
  • Could you give a more concrete example? – Douglas Apr 10 '19 at 08:52
  • You're predicting Smoke based on all other predictor variables?! Yes? – NelsonGon Apr 10 '19 at 08:54
  • Just something like: `df$Smoke<-as.factor(df$Smoke); glm(Smoke~.,df,family="binomial")`. – NelsonGon Apr 10 '19 at 08:55
  • If you really need to, maybe something like: `paste(setdiff(names(df),"Smoke"),"+")` although there are many other ways to automate the process. – NelsonGon Apr 10 '19 at 09:00
  • 1
    You're missing a parenthesis in `form <- as.formula(paste0("Smoke ~ ", i))` . Also note that that line of the loop is replacing the content of `form` with each iteration. @NelsonGon gave you an answer. – Pablo Rod Apr 10 '19 at 09:02
  • 1
    Oh, sorry, NelsonGon. I didn't get it at first. Yes, it is "smoke" against all other predictors. Thanks a lot! I will try your line. – Douglas Apr 10 '19 at 09:06
  • Take a look at this for your reference: https://stackoverflow.com/questions/13446256/meaning-of-tilde-dot-argument – NelsonGon Apr 10 '19 at 09:09
  • 3
    I urge everybody to avoid the temptation of using string concatenation to build formulas. R has a rich system of working directly on unevaluated expressions, taking the detour via strings is completely unnecessary and conceptually icky. To start with, take a look at the `bquote` and `substitute` functions. – Konrad Rudolph Apr 10 '19 at 09:24

0 Answers0