The three options c("center", "scale", "nzv") does scale and center, in the vignette:
method = "center" subtracts the mean of the predictor's data (again
from the data in x) from the predictor values while method = "scale"
divides by the standard deviation.
And nzv
basically excludes variables that have near zero variance, meaning they are almost constant and most likely not useful for prediction. To do min max, there is an option:
The "range" transformation scales the data to be within ‘rangeBounds’.
If new samples have values larger or smaller than those in the
training set, values will be outside of this range.
we try it below:
library(mlbench)
data(BostonHousing)
library(caret)
idx = sample(nrow(BostonHousing),400)
df = BostonHousing[idx,]
df$chas = as.numeric(df$chas)
pre_mdl = preProcess(df,method="range")
nn <- train(medv ~ ., data = predict(pre_mdl,df),
method = "neuralnet",tuneGrid=G,
metric = "RMSE",trControl = trainControl(
method = "cv",number = 5,verboseIter = TRUE))
nn$preProcess
Created from 400 samples and 13 variables
Pre-processing:
- ignored (0)
- re-scaling to [0, 1] (13)
summary(nn$finalModel$data)
crim zn indus chas
Min. :0.000000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.000821 1st Qu.:0.0000 1st Qu.:0.1646 1st Qu.:0.0000
Median :0.002454 Median :0.0000 Median :0.2969 Median :0.0000
Mean :0.042130 Mean :0.1309 Mean :0.3804 Mean :0.0625
3rd Qu.:0.039150 3rd Qu.:0.2000 3rd Qu.:0.6466 3rd Qu.:0.0000
Max. :1.000000 Max. :1.0000 Max. :1.0000 Max. :1.0000
nox rm age dis
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.1276 1st Qu.:0.4470 1st Qu.:0.4032 1st Qu.:0.08522
Median :0.2819 Median :0.5076 Median :0.7503 Median :0.20133
Mean :0.3363 Mean :0.5232 Mean :0.6647 Mean :0.25146
3rd Qu.:0.4918 3rd Qu.:0.5880 3rd Qu.:0.9361 3rd Qu.:0.38622
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
rad tax ptratio b
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.1304 1st Qu.:0.1770 1st Qu.:0.5106 1st Qu.:0.9475
Median :0.1739 Median :0.2729 Median :0.6862 Median :0.9861
Mean :0.3676 Mean :0.4171 Mean :0.6243 Mean :0.8987
3rd Qu.:1.0000 3rd Qu.:0.9141 3rd Qu.:0.8085 3rd Qu.:0.9983
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
lstat .outcome
Min. :0.0000 Min. :0.0000
1st Qu.:0.1492 1st Qu.:0.2683
Median :0.2705 Median :0.3644
Mean :0.3069 Mean :0.3902
3rd Qu.:0.4220 3rd Qu.:0.4450
Max. :1.0000 Max. :1.0000
Not very sure what you mean by "undo the scaling when predicting". Maybe you meant translating them back to the original scale:
test = BostonHousing[-idx,]
test$chas = as.numeric(test$chas)
test_medv = test$medv
test = predict(pre_mdl,test)
The range is stored under the preProcess model, under
pre_mdl$ranges
crim zn indus chas nox rm age dis rad tax ptratio b
[1,] 0.00632 0 0.46 1 0.385 3.561 2.9 1.1691 1 187 12.6 0.32
[2,] 88.97620 100 27.74 2 0.871 8.780 100.0 12.1265 24 711 22.0 396.90
lstat medv
[1,] 1.73 5
[2,] 36.98 50
So we write a wrapper:
convert_response = function(value,mdl,method,column){
bounds = mdl[[method]][,column]
value*diff(bounds) + min(bounds)
}
plot(test_medv,convert_response(predict(nn,test),pre_mdl,"ranges","medv"),
ylab="predicted")
