preProc = c("center", "scale") meaning in caret's package (R) and min-max normalization

Question

I am wondering how preProc can be used within the train() function of caret. I am running a neural network in the train() function using neuralnet. The code comes from this question.

This is actually the code:

nn <- train(medv ~ ., 
            data = df, 
            method = "neuralnet", 
            tuneGrid = grid,
            metric = "RMSE",
            preProc = c("center", "scale", "nzv"), #good idea to do this with neural nets - your error is due to non scaled data
            trControl = trainControl(
              method = "cv",
              number = 5,
              verboseIter = TRUE)
            )

The original data is not scaled, so that it is recommended to scale the data before running the neural network.

However, in the argument preProc appears three elements: center, scale, nzv. I am having problems interpreting those values, as I do not know why they are present. Furthermore, I would like to scale/normalize my data using min-max. This would be the function:

maxs = apply(pk_dc_only$C, 2, max)
mins = apply(pk_dc_only$C, 2, min)
scaled = as.data.frame(scale(df, center = mins, scale = maxs - mins))

Is it possible to normalize my data using min-max scaling within preProc?

And if so, how could I undo the scaling when predicting?

score 1 · Accepted Answer · answered Jun 05 '20 at 20:59

The three options c("center", "scale", "nzv") does scale and center, in the vignette:

method = "center" subtracts the mean of the predictor's data (again from the data in x) from the predictor values while method = "scale" divides by the standard deviation.

And nzv basically excludes variables that have near zero variance, meaning they are almost constant and most likely not useful for prediction. To do min max, there is an option:

The "range" transformation scales the data to be within ‘rangeBounds’. If new samples have values larger or smaller than those in the training set, values will be outside of this range.

we try it below:

library(mlbench)
data(BostonHousing)
library(caret)

idx = sample(nrow(BostonHousing),400)
df = BostonHousing[idx,]
df$chas = as.numeric(df$chas)
pre_mdl = preProcess(df,method="range")

nn <- train(medv ~ ., data = predict(pre_mdl,df),
method = "neuralnet",tuneGrid=G,
metric = "RMSE",trControl = trainControl(
method = "cv",number = 5,verboseIter = TRUE))

nn$preProcess
Created from 400 samples and 13 variables

Pre-processing:
  - ignored (0)
  - re-scaling to [0, 1] (13)

summary(nn$finalModel$data)


          crim                zn             indus             chas       
 Min.   :0.000000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.000821   1st Qu.:0.0000   1st Qu.:0.1646   1st Qu.:0.0000  
 Median :0.002454   Median :0.0000   Median :0.2969   Median :0.0000  
 Mean   :0.042130   Mean   :0.1309   Mean   :0.3804   Mean   :0.0625  
 3rd Qu.:0.039150   3rd Qu.:0.2000   3rd Qu.:0.6466   3rd Qu.:0.0000  
 Max.   :1.000000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
      nox               rm              age              dis         
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.1276   1st Qu.:0.4470   1st Qu.:0.4032   1st Qu.:0.08522  
 Median :0.2819   Median :0.5076   Median :0.7503   Median :0.20133  
 Mean   :0.3363   Mean   :0.5232   Mean   :0.6647   Mean   :0.25146  
 3rd Qu.:0.4918   3rd Qu.:0.5880   3rd Qu.:0.9361   3rd Qu.:0.38622  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
      rad              tax            ptratio             b         
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.1304   1st Qu.:0.1770   1st Qu.:0.5106   1st Qu.:0.9475  
 Median :0.1739   Median :0.2729   Median :0.6862   Median :0.9861  
 Mean   :0.3676   Mean   :0.4171   Mean   :0.6243   Mean   :0.8987  
 3rd Qu.:1.0000   3rd Qu.:0.9141   3rd Qu.:0.8085   3rd Qu.:0.9983  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     lstat           .outcome     
 Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.1492   1st Qu.:0.2683  
 Median :0.2705   Median :0.3644  
 Mean   :0.3069   Mean   :0.3902  
 3rd Qu.:0.4220   3rd Qu.:0.4450  
 Max.   :1.0000   Max.   :1.0000

Not very sure what you mean by "undo the scaling when predicting". Maybe you meant translating them back to the original scale:

test = BostonHousing[-idx,]
test$chas = as.numeric(test$chas)
test_medv = test$medv
test = predict(pre_mdl,test)

The range is stored under the preProcess model, under

pre_mdl$ranges
         crim  zn indus chas   nox    rm   age     dis rad tax ptratio      b
[1,]  0.00632   0  0.46    1 0.385 3.561   2.9  1.1691   1 187    12.6   0.32
[2,] 88.97620 100 27.74    2 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
     lstat medv
[1,]  1.73    5
[2,] 36.98   50

So we write a wrapper:

convert_response = function(value,mdl,method,column){
bounds = mdl[[method]][,column]
value*diff(bounds) + min(bounds)
}

plot(test_medv,convert_response(predict(nn,test),pre_mdl,"ranges","medv"),
ylab="predicted")

preProc = c("center", "scale") meaning in caret's package (R) and min-max normalization

1 Answers1

Linked