If you have a training set (the original data), and a test set (the new data), and you build a model using the training set scaled to [0,1], then when you make predictions with this model using the test set, you have to scale that first as well. But be careful: you have to scale the test set using the same parameters as the training set. So if you use x-min(x)/(max(x)-min(x))
to scale, you must use the values of max(x)
and min(x)
from the training dataset. Here's an example:
set.seed(1) # for reproducible example
train <- data.frame(X1=sample(1:100,100),
X2=1e6*sample(1:100,100),
X3=1e-6*sample(1:100,100))
train$y <- with(train,2*X1 + 3*1e-6*X2 - 5*1e6*X3 + 1 + rnorm(100,sd=10))
fit <- lm(y~X1+X2+X3,train)
summary(fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.063e+00 3.221e+00 0.33 0.742
# X1 2.017e+00 3.698e-02 54.55 <2e-16 ***
# X2 2.974e-06 3.694e-08 80.51 <2e-16 ***
# X3 -4.988e+06 3.715e+04 -134.28 <2e-16 ***
# ---
# scale the predictor variables to [0,1]
mins <- sapply(train[,1:3],min)
ranges <- sapply(train[,1:3],function(x)diff(range(x)))
train.scaled <- as.data.frame(scale(train[,1:3],center=mins,scale=ranges))
train.scaled$y <- train$y
fit.scaled <- lm(y ~ X1 + X2 + X3, train.scaled)
summary(fit.scaled)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.066 3.164 0.337 0.737
# X1 199.731 3.661 54.553 <2e-16 ***
# X2 294.421 3.657 80.508 <2e-16 ***
# X3 -493.828 3.678 -134.275 <2e-16 ***
# ---
Note that, as expected, scaling affects the value of the coefficients (of course...), but not the t-values, or the se of the fit, or RSQ, or F (I've only reproduced part of the summaries here).
Now let's compare the effect of scaling with a test dataset.
# create test dataset
test <- data.frame(X1=sample(-5:5,10),
X2=1e6*sample(-5:5,10),
X3=1e-6*sample(-5:5,10))
# predict y based on test data with un-scaled fit
pred <- predict(fit,newdata=test)
# scale the test data using min and range from training dataset
test.scaled <- as.data.frame(scale(test[,1:3],center=mins,scale=ranges))
# predict y based on new data scaled, with fit from scaled dataset
pred.scaled <- predict(fit.scaled,newdata=test.scaled)
all.equal(pred,pred.scaled)
# [1] TRUE
So prediction using the un-scaled fit with un-scaled data yields exactly the same result as prediction using the scaled fit with scaled data.