1

I have a dataset, say Data, which consists of categorical and numerical variables. After cleaning them, I have scaled only the numerical variables (guess catgorical must not be scaled) using

Data <- Data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

I then split it randomly to 70-30 percentage using

set.seed(123)
sample_size = floor(0.70*nrow(Data))
xyz <- sample(seq_len(nrow(Data)),size = sample_size)
Train_Set <- Join[xyz,]
Test_Set <- Join[-xyz,]

I have built a classification model using ranger, say model_rang, using Train_Set and tested on it using Test_Set.

If a new data, say new_data, arrives for production, after cleaning it, is it enough to scale it the above way? I mean

new_data <- new_data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

and then use it to predict the outcome using (there are two classes 0 and 1 and 1 is of interest)

probabilities <- as.data.frame(predict(model_rang, data = new_data, num.trees = 5000, type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilities) - 1,new_data$Class), positive='1')

Is the scale done properly as in Data or am I missing any crucial stuff in the production data?

Or, must I scale Train_Set separately and take the standard deviation of each variable and associated mean to scale Test_Set, and when new data during production arrives, the old standard deviation and mean from Train_Set be applied to every new data set?

Ray
  • 321
  • 2
  • 12
  • I find this link very interesting: https://stats.stackexchange.com/questions/89172/how-to-scale-new-observations-for-making-predictions-when-the-model-was-fitted-w – Ray Jun 05 '20 at 08:33

1 Answers1

2

When you scale the data, you subtract the mean off it and divide by the standard deviation. The mean and standard deviation in your new data might not be the same as that in the (training data) used to construct your model.

Imagine in your random forest, one variable was split at 0.555 (scaled data) and now in your new data, the standard deviation is lower, values that would be below 0.555 are now over, and will be classified into a different class.

One thing you can do is store the attributes like the post you pointed to:

set.seed(111)

data = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

num_cols = names(which(sapply(data,is.numeric)))

scale_params = attributes(scale(data[,num_cols]))[c("scaled:center","scaled:scale")]

newdata = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

newdata[,num_cols] = scale(newdata[,num_cols],
center=scale_params[[1]],scale=scale_params[[2]])
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • 1
    +1 for the clear answer and example! By the way, does this rationale also apply to the `Test_Set` specified in the question, or only for production data such as `new_data`? If the first one is true, can we thus say that it is possible to perform scaling of `Data` before splitting in training and test sets? – Ric S Jun 05 '20 at 09:56
  • Thanks @RicS for the feedback. You normally scale the data before splitting into test and train. Makes more sense, and in OP's example, it's done this way. So it only applies to new_data – StupidWolf Jun 05 '20 at 11:57
  • 1
    Thank you @StupidWolf. By the way, I saw this answer on SO https://stackoverflow.com/questions/49444262/normalize-data-before-or-after-split-of-training-and-testing-data that says the opposite to what you said in your comment. What do you think about that? – Ric S Jun 05 '20 at 12:19
  • 1
    Also these links https://stats.stackexchange.com/questions/267012/difference-between-preprocessing-train-and-test-set-before-and-after-splitting and https://datascience.stackexchange.com/questions/54908/data-normalization-before-or-after-train-test-split – Ric S Jun 05 '20 at 12:26
  • The first link you share, the accepted answer says: "Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables." – StupidWolf Jun 05 '20 at 12:36
  • I thought this is what i was proposing? or maybe I went awry in my phrasing somehow.. – StupidWolf Jun 05 '20 at 12:36
  • In your first comment you said "you normally scale the data before splitting into test and train", but the accepted answer of the first link says the opposite "You first need to split the data into training and test set (validation set could be useful too)". – Ric S Jun 05 '20 at 12:50
  • Thank you very much for the explanation with an example. – Ray Jun 08 '20 at 09:22