I am using lm()
on a training set of data that includes a polynomial. When I subset in advance with [ ]
I get different coefficients compared to using the subset
argument in the lm()
function call. Why?
library(ISLR2)
set.seed (1)
train <- sample(392, 196)
auto_train <- Auto[train,]
lm.fit.data <- lm(mpg ~ poly(horsepower, 2), data = auto_train)
summary(lm.fit.data)
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = auto_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.8711 -2.6655 -0.0096 2.0806 16.1063
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.8745 0.3171 75.298 < 2e-16 ***
#> poly(horsepower, 2)1 -89.3337 4.4389 -20.125 < 2e-16 ***
#> poly(horsepower, 2)2 33.2985 4.4389 7.501 2.25e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.439 on 193 degrees of freedom
#> Multiple R-squared: 0.705, Adjusted R-squared: 0.702
#> F-statistic: 230.6 on 2 and 193 DF, p-value: < 2.2e-16
lm.fit.subset <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
summary(lm.fit.subset)
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto, subset = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.8711 -2.6655 -0.0096 2.0806 16.1063
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.5496 0.3175 74.182 < 2e-16 ***
#> poly(horsepower, 2)1 -123.5881 6.4587 -19.135 < 2e-16 ***
#> poly(horsepower, 2)2 47.7189 6.3613 7.501 2.25e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.439 on 193 degrees of freedom
#> Multiple R-squared: 0.705, Adjusted R-squared: 0.702
#> F-statistic: 230.6 on 2 and 193 DF, p-value: < 2.2e-16
Created on 2021-12-26 by the reprex package (v2.0.1)