1

I am having a problem with building lm function based on many independent variables in for loop. 14 different independent variables (x1, x2, x3 ..., x14) are created in each for loop and as a result the name of the variables (strings) are saved in vector 'independent_variables'. For dependent variable y1, I would like to build the lm function lm(y1 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 + x13 + x14)

I've tried to paste the elements of the list and typed it in lm function, but it doesn't seem to recognize this as a formula.

 for (j in 1:length(num)) { 
      nam <- paste("x",j, sep="")
      assign(nam, vec)
      independent_variables <- c(independent_variables, nam)
    }    

> independent_variables
  [1] "x1"  "x2"  "x3"  "x4"  "x5"  "x6"  "x7"  "x8"  "x9"  "x10" "x11" 
      "x12" "x13" "x14"

they are independent variables of the linear regression function and each element a matrix which has 318 rows in 1 column. Also, for the dependent variable y1, I have a matrix which has the same dimension.

> x1
                                                    COAD_65
ACCx_025FE5F8_885E_433D_9018_7AE322A92285_X034_S09 -0.368827920
ACCx_2A5AE757_20D5_49B6_95FF_CAE08E8197A0_X012_S05 -0.418133754
ACCx_3D0CD3BD_3960_46FB_92C3_777F11CCD0FC_X011_S06 -0.885246719
ACCx_4D0D43F5_D8F0_4735_92D5_F40E321C7A05_X010_S09 -0.908954868
ACCx_81A262BD_3078_4BDB_8EB1_30DD6D7948C3_X027_S03 -0.284544506
ACCx_B6E6F014_A599_4A58_A7A5_1F748471D662_X013_S12 -0.991800815
ACCx_B901534B_5E93_475A_91E7_B2DB7DFE98A5_X020_S02 -0.538162178
ACCx_EDEB779F_A603_479D_AAFE_428BC7E4B8DB_X038_S03 -0.462774125
...

UCEC_BDFE8123_081E_49AF_930B_2371D8DEC261_X030_S01 -1.032249118
UCEC_C335297F_2D63_4973_9182_FA18C28E001E_X037_S04 -0.550676273
UCEC_D820B024_6B3B_4B5B_866E_F9A8139C270B_X039_S09 -0.036913872


> y1
 TCGA-OR-A5K8-01A TCGA-PK-A5H8-01A TCGA-OR-A5J3-01A TCGA-OR-A5J6-01A TCGA- 
 OR-A5KX-01A TCGA-OR-A5J2-01A 
 0.000000000      0.000000000      0.013752216      0.000000000      
 0.000000000      0.000000000 
 TCGA-OR-A5J9-01A TCGA-OR-A5JZ-01A TCGA-PA-A5YG-01A TCGA-CU-A3YL-01A TCGA- 
 GD-A3OQ-01A TCGA-CF-A3MI-01A 
 0.009707204      0.000000000      0.000000000      0.000000000      
 0.000000000      0.119174367 
 ...
 TCGA-BL-A13J-01A TCGA-GV-A3JW-01A TCGA-DK-A1AD-01A TCGA-FD-A3SR-01A TCGA- 
 CF-A1HR-01A TCGA-BL-A3JM-01A 
 0.019066953      0.355925504      0.019473742      0.062201816      
 0.081559894      0.243386421

After creating correct lm function, the result should look like this for example

> Call: lm(formula = y1 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
>     x10 + x11 + x12 + x13 + x14)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -0.6282 -0.1130 -0.0257  0.0491  6.0798 
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)     (Intercept)  0.054546   0.040219   1.356   0.1759     x1           0.145644   0.035340   4.121 4.66e-05 *** x2           0.005909   0.038020   0.155   0.8766     x3          -0.085892   0.051854  -1.656   0.0985 .   x4           0.032686   0.029443   1.110   0.2677     x5          -0.047268   0.033388  -1.416   0.1577     x6           0.026735   0.032327   0.827   0.4088     x7           0.035673   0.051047   0.699   0.4851     x8           0.037374   0.060258   0.620   0.5355     x9           0.024493   0.053045   0.462   0.6445     x10          0.006623   0.059025   0.112   0.9107     x11         -0.017017   0.034501  -0.493   0.6221     x12          0.032184   0.046235   0.696   0.4868     x13          0.009988   0.033298   0.300   0.7644     x14         -0.017836   0.024505  -0.728   0.4672    
> --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Residual standard error: 0.3936 on 366 degrees of freedom Multiple
> R-squared:  0.2768,   Adjusted R-squared:  0.2492  F-statistic: 10.01 on
> 14 and 366 DF,  p-value: < 2.2e-16
MatthewR
  • 2,660
  • 5
  • 26
  • 37
na.0o0
  • 25
  • 3
  • 1
    A few more good ideas here - https://stackoverflow.com/questions/5251507/how-to-succinctly-write-a-formula-with-many-variables-from-a-data-frame . Especially the reformulate function (ex `reformulate(names(anscombe), "y1")`) – M.Viking Jun 14 '19 at 14:13
  • 1
    @M.Viking Thanks. good information. It helped me a lot! – na.0o0 Jun 17 '19 at 05:51

1 Answers1

1

Using the builtin anscombe data frame:

names(anscombe)
## [1] "x1" "x2" "x3" "x4" "y1" "y2" "y3" "y4"
indep_names <- grep("x", names(anscombe), value = TRUE)

lm(anscombe[c("y1", indep_names)])

or

lm(y1 ~., anscombe[c("y1", indep_names)])

or

fo <- sprintf("y1 ~ %s", paste(indep_names, collapse = "+"))
do.call("lm", c(fo, quote(anscombe)))

or

fo <- reformulate(indep_names, response = "y1")
do.call("lm", c(fo, quote(anscombe)))

In the last two cases we could write lm(fo, anscombe) instead if we don't care that the formula then shows up as just fo in the output.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I've tried third one and it worked. Thanks! Seems like lm(fo) would make the same result as it should be, is 'lm(fo)' the right method as well? – na.0o0 Jun 14 '19 at 12:51
  • lm(fo) will work if the variables are not in a data.frame; however, note that the output would show the formula as `fo` rather than showing the content of `fo` (as mentioned already in the answer). – G. Grothendieck Jun 14 '19 at 12:57