3

I am trying to create a script to optimize a linear regression analysis, and I would really like to operate on the model output, most specifically the Pr(>|t|) value. Unfortunately, I do not know how to get the model output into a matrix or data table.

Here is an example: In the code below, I create seven columns of data, and fit the seventh using the other six. When I get a summary of the model, it is clear that three of the parameters are much more significant than than the other three. If I had access to the coefficient output numerically, I could perhaps create a script to drop the least significant parameter and re-run the analysis... however as it is, I am doing this manually.

What is the best way to do this?

q = matrix( 
c(2,14,-4,1,10,9,41,8,13,2,0,20,3,27,1,10,-1,0,
10,-6,23,6,13,-8,1,15,-7,55,7,14,10,0,20,-3,6,4,20,
-1,5,19,-2,48,10,19,8,8,10,-2,24,8,13,9,8,14,5,7,7,
12,1,0,16,7,27,7,10,-1,1,15,7,31,2,20,-5,10,12,3,57,
0,19,-8,8,11,-4,63,5,11,7,8,10,-7,6,9,10,-7,2,19,8,
51,2,18,3,3,14,4,30), nrow=15, ncol=7, byrow = TRUE)
#
colnames(q) <- c("A","B","C","D","E","F","Z")
#
q <- as.data.frame(q)
#
qmodel <- lm(Z~.,data=q)
#
summary(qmodel)
#

Output:

Call:
lm(formula = Z ~ ., data = q)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.25098 -0.52655 -0.02931  0.62350  1.26649 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.09303    1.51627  -1.380    0.205    
A            0.91161    0.11719   7.779 5.34e-05 ***
B            1.99503    0.09539  20.914 2.87e-08 ***
C           -2.98252    0.04789 -62.283 4.91e-12 ***
D            0.13458    0.10377   1.297    0.231    
E            0.15191    0.09397   1.617    0.145    
F            0.01417    0.04716   0.300    0.772    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9439 on 8 degrees of freedom
Multiple R-squared:  0.9986,    Adjusted R-squared:  0.9975 
F-statistic: 928.9 on 6 and 8 DF,  p-value: 6.317e-11

Now here is what I'd like to see:

 > coeffs
             Estimate Std. Error t value Pr(>|t|)
 (Intercept) -2.09303    1.51627  -1.380 2.05e-01
 A            0.91161    0.11719   7.779 5.34e-05
 B            1.99503    0.09539  20.914 2.87e-08
 C           -2.98252    0.04789 -62.283 4.91e-12
 D            0.13458    0.10377   1.297 2.31e-01
 E            0.15191    0.09397   1.617 1.45e-01
 F            0.01417    0.04716   0.300 7.72e-01

As it is, I got that in this manner... not automated at all...

coeffs = matrix(
c(-2.09303,1.51627,-1.38,0.205,0.91161,0.11719,
7.779,0.0000534,1.99503,0.09539,20.914,0.0000000287,
-2.98252,0.04789,-62.283,0.00000000000491,0.13458,
0.10377,1.297,0.231,0.15191,0.09397,1.617,0.145,
0.01417,0.04716,0.3,0.772), nrow=7, ncol=4, byrow = TRUE)
#
rownames(coeffs) <- c("(Intercept)","A","B","C","D","E","F")
colnames(coeffs) <- c("Estimate","Std. Error","t value","Pr(>|t|)")
#
coeffs <- as.data.frame(coeffs)
#
coeffs
halfer
  • 19,824
  • 17
  • 99
  • 186
rucker
  • 393
  • 3
  • 13

2 Answers2

8

What you want is the coefficients component of the summary object.

m <- lm(Z~.,data=q)

summary(m)$coefficients

Some further comments:

  • Use step to do stepwise variable selection rather than coding it yourself;
  • Stepwise variable selection has bad statistical properties; consider something like glmnet (in the package of the same name) to do regularized model building instead.
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • Hong, this is brilliant! Exactly what I was looking for... although now I am wondering if my approach is flawed. I was not aware that stepwise variable selection has bad statistical properties. Can you elaborate a bit? What sort of errors am I likely to encounter? – rucker Aug 19 '14 at 00:17
  • Basically, stepwise methods are prone to overfit your data, meaning they'll mistake noise for signal. The problem is worst when you have small datasets and lots of variables, but you still need to be careful even with big datasets. For more info check out CrossValidated, the statistics/machine learning StackExchange. http://stats.stackexchange.com/questions/tagged/stepwise-regression – Hong Ooi Aug 19 '14 at 01:38
3

If I understand correctly, you need the matrix returned by the summary. That's pretty straight forward:

fit <- lm( formula, data=yourData)
coeffs <- summary(fit)$coefficients

After that, you can select the records from coeffs that match your conditions, just like with any matrix. Example:

coeffs[coeffs[4,] < 1e-12,]
Barranka
  • 20,547
  • 13
  • 65
  • 83