Scaling independent variables while predicting using linear regression model

Question

I am trying to get a linear model where Y is dependent variable and X1, X2, X3 are my independent variables.

Have scaled my input using 'scale' method in R and got the eo-efficients and intercept.

Y = a1X1 + a2X2 + a3X3 + c

Now, to predict Y for given value of (X1, X2, X3), is it ok to directly compute value of Y using above equation or should the input variables be scaled before putting them in equation ? If yes, how can we scale them ?

score 5 · Answer 1 · answered Jul 09 '14 at 19:53

If you have a training set (the original data), and a test set (the new data), and you build a model using the training set scaled to [0,1], then when you make predictions with this model using the test set, you have to scale that first as well. But be careful: you have to scale the test set using the same parameters as the training set. So if you use x-min(x)/(max(x)-min(x)) to scale, you must use the values of max(x) and min(x) from the training dataset. Here's an example:

set.seed(1)      # for reproducible example
train <- data.frame(X1=sample(1:100,100),
                 X2=1e6*sample(1:100,100),
                 X3=1e-6*sample(1:100,100))
train$y <- with(train,2*X1 + 3*1e-6*X2 - 5*1e6*X3 + 1 + rnorm(100,sd=10))

fit  <- lm(y~X1+X2+X3,train)
summary(fit)
# ...
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.063e+00  3.221e+00    0.33    0.742    
# X1           2.017e+00  3.698e-02   54.55   <2e-16 ***
# X2           2.974e-06  3.694e-08   80.51   <2e-16 ***
# X3          -4.988e+06  3.715e+04 -134.28   <2e-16 ***
# ---

# scale the predictor variables to [0,1]
mins   <- sapply(train[,1:3],min)
ranges <- sapply(train[,1:3],function(x)diff(range(x)))
train.scaled <- as.data.frame(scale(train[,1:3],center=mins,scale=ranges))
train.scaled$y <- train$y
fit.scaled <- lm(y ~ X1 + X2 + X3, train.scaled)
summary(fit.scaled)
# ...
# Coefficients:
#             Estimate Std. Error  t value Pr(>|t|)    
# (Intercept)    1.066      3.164    0.337    0.737    
# X1           199.731      3.661   54.553   <2e-16 ***
# X2           294.421      3.657   80.508   <2e-16 ***
# X3          -493.828      3.678 -134.275   <2e-16 ***
# ---

Note that, as expected, scaling affects the value of the coefficients (of course...), but not the t-values, or the se of the fit, or RSQ, or F (I've only reproduced part of the summaries here).

Now let's compare the effect of scaling with a test dataset.

# create test dataset
test <- data.frame(X1=sample(-5:5,10),
                      X2=1e6*sample(-5:5,10),
                      X3=1e-6*sample(-5:5,10))
# predict y based on test data with un-scaled fit
pred   <- predict(fit,newdata=test)

# scale the test data using min and range from training dataset
test.scaled <- as.data.frame(scale(test[,1:3],center=mins,scale=ranges))
# predict y based on new data scaled, with fit from scaled dataset
pred.scaled   <- predict(fit.scaled,newdata=test.scaled)

all.equal(pred,pred.scaled)
# [1] TRUE

So prediction using the un-scaled fit with un-scaled data yields exactly the same result as prediction using the scaled fit with scaled data.

Gregor Thomas · Answer 2 · 2014-07-09T18:36:33.597

2

is it ok to directly compute value of Y using above equation or should the input variables be scaled before putting them in equation

The input variables should be scaled in the same way as you did your initial scaling.

If yes, how can we scale them ?

Read the documentation for the command you used (?scale) and see what it did! Then replicate it for you new prediction data. If you used the defaults, it subtracted the means of your original predictors, then divided by the standard deviation. You should go back to the raw data, calculate the means and standard deviations, and use those to scale your data for prediction in the same way.

Transforming fitted coefficients

Your other option is to transform the coefficients. This just takes a little bit of algebra. If your scaling transformation is f(x) = mx + b, and your fitted model is y = a * f(x), it's easy to see that

y = a * f(x) + c
y = a * (mx + b) + c
y = a m x + a b + c

So, with untransformed data x your slope is a * m and your intercept is a * b + c. This is easily extended to more variables or a different transformation. If you're transforming to [0, 1], your transformation is probably f(x) = (x - min(x)) / (max(x) - min(x))... the algebra shouldn't be difficult, but I'll leave it to you.

edited Jul 09 '14 at 18:36

answered Jul 09 '14 at 17:04

Gregor Thomas

136,190
20
167
294

I am planning to use min-max normalization here, which means values will always lie between [0,1]. If i use min, max from raw data only, there there's chance that new value is out of this min-max range. Will it impact ? or are you suggesting that i add this new value back to raw data and rescale it ? – Mohit Verma Jul 09 '14 at 17:08
All I'm saying is that, however you choose to scale your raw data, you need to apply the same transformation to the data you want to predict on. Adding your prediction data into the original data set to scale it, then refitting your model, would work. – Gregor Thomas Jul 09 '14 at 17:18
Your other option is to transform the fitted coefficients so that they can be applied to raw predictors. – Gregor Thomas Jul 09 '14 at 17:19
"transform the fitted coefficients" ..how is this achieved ? – Mohit Verma Jul 09 '14 at 17:24
1

see also http://stackoverflow.com/questions/24268031/unscale-and-uncenter-glmer-parameters/24286763#24286763 – Ben Bolker Jul 09 '14 at 20:03
@BenBolker Wow, a nice thorough answer there. – Gregor Thomas Jul 09 '14 at 20:20

Scaling independent variables while predicting using linear regression model

2 Answers2

Transforming fitted coefficients