I have a dataset, say Data, which consists of categorical and numerical variables. After cleaning them, I have scaled only the numerical variables (guess catgorical must not be scaled) using
Data <- Data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)
I then split it randomly to 70-30 percentage using
set.seed(123)
sample_size = floor(0.70*nrow(Data))
xyz <- sample(seq_len(nrow(Data)),size = sample_size)
Train_Set <- Join[xyz,]
Test_Set <- Join[-xyz,]
I have built a classification model using ranger, say model_rang, using Train_Set and tested on it using Test_Set.
If a new data, say new_data, arrives for production, after cleaning it, is it enough to scale it the above way? I mean
new_data <- new_data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)
and then use it to predict the outcome using (there are two classes 0 and 1 and 1 is of interest)
probabilities <- as.data.frame(predict(model_rang, data = new_data, num.trees = 5000, type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilities) - 1,new_data$Class), positive='1')
Is the scale done properly as in Data or am I missing any crucial stuff in the production data?
Or, must I scale Train_Set separately and take the standard deviation of each variable and associated mean to scale Test_Set, and when new data during production arrives, the old standard deviation and mean from Train_Set be applied to every new data set?