Fitting a polynomial regression model selected by `leaps::regsubsets`

Question

I have performed best subset selection of linear regression model using leaps::regsubsets. Then I chose the model with 14 predictors and using coef(model, 14) gave me the following output:

structure(c(16.1303774392893, -0.0787496652705482, -0.104929454314886, 
-1.22322411065346, 1.14718778105312, 0.75468065020279, 0.455617836039703, 
0.521951041899427, 0.0124590834643436, -0.0002293804247409, 
1.26667965342874e-07, 1.4002805624594e-06, -9.90560347112683e-07, 
1.8809273394337e-06, 5.48249071436573e-07), .Names = c("(Intercept)", "X1", 
"X2", "poly(X4, 2)1", "poly(X5, 2)1", "poly(X6, 2)2", "poly(X7, 2)2", 
"poly(X9, 2)1", "X10", "X12", "X13", "X14", "X16", "X17", "X18"))

To get this model, I need to fit it with lm. As poly(X, 2)1 is linear and poly(X, 2)2 is quadratic, I did:

lm(X20 ~ X1 + X2 + X4 + X5 + I(X6 ^ 2) + I(X7 ^ 2) +
         X9 + X10 + X12 + X13 + X14 + X16 + X17 + X18, df)

I think I know why coefficients are different (see poly() in lm(): difference between raw vs. orthogonal), but why don't they give the same fitted values and adjusted R2?

Of course, using poly(X, 2)[,2] in the formula gives complete consistency with regsubsets output. But is it valid to use only second term orthogonal polynomial and specify the model as follows?

lm(X20 ~ X1 + X2 + X4 + X5 + poly(X6, 2)[,2] + poly(X7, 2)[,2] +
   X9 + X10 + X12 + X13 + X14 + X16 + X17 + X18, df)

Is there more direct way to retrieve single model from regsubsets output than specifying the model by hand?

I have added some questions that are uncovered in attached link. I don't think it is a duplicate anymore. If the fitted values from both 'raw' and 'orthogonal' models should be the same then why adjusted R2 is different? Is the lattest `lm` formula valid? — jakes, Jul 07 '18 at 16:20

Zheyuan Li · Accepted Answer · 2018-07-07T18:20:56.127

but why don't they give the same fitted values and adjusted R2?

Fitted values won't necessarily be the same, if you don't use all columns from poly.

set.seed(0)
y <- runif(100)
x <- runif(100)
X <- poly(x, 3)

all.equal(lm(y ~ X)$fitted, lm(y ~ x + I(x ^ 2) + I(x ^ 3))$fitted)
#[1] TRUE

all.equal(lm(y ~ X[, 1:2])$fitted, lm(y ~ x + I(x ^ 2))$fitted)
#[1] TRUE

all.equal(lm(y ~ X - 1)$fitted, lm(y ~ x + I(x ^ 2) + I(x ^ 3) - 1)$fitted)  ## no intercept
#[1] "Mean relative difference: 33.023"

all.equal(lm(y ~ X[, c(1, 3)])$fitted, lm(y ~ x + I(x ^ 3))$fitted)
#[1] "Mean relative difference: 0.03008166"

all.equal(lm(y ~ X[, c(2, 3)])$fitted, lm(y ~ I(x ^ 2) + I(x ^ 3))$fitted)
#[1] "Mean relative difference: 0.03297488"

We only have ~ 1 + poly(x, degree)[, 1:k] equivalent to ~ 1 + x + I(x ^ 2) + ... + I(x ^ k), for any k <= degree. (I explicitly write out the intercept, to emphasize that we have to start from polynomial of degree 0.)

(The reason is related to how an orthogonal polynomial is generated. See How `poly()` generates orthogonal polynomials? How to understand the "coefs" returned? for great great details. Note that when doing a QR factorization X = QR, as R is an upper triangular matrix (not a diagonal matrix), Q[, ind] will not have the same column space with X[, ind] for an arbitrary subset ind, unless ind = 1:k.)

So, I(x ^ 2) is not equivalent to ploy(x, 2)[, 2], and you will get different fitted values hence (adjusted) R2.

is it valid to use only second term orthogonal polynomial and specify the model as follows?

It is really a bad idea for leaps (or generally any modeler) to drop columns from an orthogonal polynomial. An orthogonal polynomial is a factor-alike term, whose significance is determined by F-statistics (i.e., treating all columns as a whole), rather than t-statistics for individual columns.

In fact, even for raw polynomials, it is not a good idea to omit any low order term. For example, y ~ 1 + I(x ^ 2) omitting linear term is not a good idea. A basic problem here is that it is not invariant to linear shift. For example, if we shift x for x1:

shift <- runif(1)  ## an arbitrary value; can be `mean(x)`
x1 <- x - shift

then y ~ 1 + I(x ^ 2) is not equivalent to y ~ 1 + I(x1 ^ 2), but still, y ~ 1 + x + I(x ^ 2) is equivalent to y ~ 1 + x1 + I(x1 ^ 2).

all.equal(lm(y ~ 1 + I(x ^ 2))$fitted, lm(y ~ 1 + I(x1 ^ 2))$fitted)
#[1] "Mean relative difference: 0.02020984"

all.equal(lm(y ~ 1 + x + I(x ^ 2))$fitted, lm(y ~ 1 + x1 + I(x1 ^ 2))$fitted)
#[1] TRUE

I briefly mentioned the issue of dropping columns at R: How to or should I drop an insignificant orthogonal polynomial basis in a linear model?, but my examples here give you more insight.

Is there more direct way to retrieve single model from regsubsets output than specifying the model by hand?

I don't know; at least I did not figure it out almost 2 years ago when answering this thread: Get all models from leaps regsubsets.

One remaining question though. Assuming that leaps returns poly(X, 2)1 I should definitely retain poly(X, 2)1 in my model. But what if only poly(X, 2)1 is returned by leaps? Can higher order term can be dropped then?

There is no problem dropping higher order terms (in this case where you originally fitted a quadratic polynomial). As I said, we have equivalence for ind = 1:j, where j <= degree. But make sure you understand this. Take the following two examples.

If leaps drops poly(x, 5)3 and poly(x, 5)5. you can safely remove poly(x, 5)5, but are still advised to retain poly(x, 5)3. This is, instead of fitting an 5-th order polynomial, you fit a 4-th order one.
If leaps drops poly(x, 6)3 and poly(x, 6)5. Since poly(x, 6)6 is not dropped, you are advised to drop no terms at all.

Wow, it is very insightful. One remaining question though. Assuming that `leaps` returns `poly(X, 2)1` I should remain `poly(X, 2)1` in a model as well, that's clear. But what if only `poly(X, 2)1` is returned by `leaps`? Can higher order term can be dropped then? — jakes, Jul 07 '18 at 18:07

Fitting a polynomial regression model selected by `leaps::regsubsets`

1 Answers1