0

I have a problem in classification prediction. Originally I have a data frame of size 19670 in 115 variables (numeric and categorical). The class variable, BiClass ( consiting of "0" and "1" class), I am triyng to model has 13540 "0" class and 6130 "1" class. "1"s are of interest. I have divided the entire data frame in 80-20 ratio, and get a Train set of 15736 and Test set of 3934 observations. Train has 10853 "0" class and 4883 "1" class, and Test has 2687 "0" class and 1247 "1" class.

Trained random forest classification model on the Train set using ranger and did prediction on the Test set. I get the prediction results as follows

enter image description here

Now I get totally new data with 2665 observations for classification. When I use the developed model on the Train data on the new data, I get prediction results as follows

enter image description here

This is wrong since the new data classes are originally 1962 "0" and 703 "1". However, the model places everything as false positive.

I thought the model is biased towards 0 and must be a class imbalance issue and used ROSE and SMOTE to prepare data and then train the models. I used these trained models to get the following prediction:

ROSE:

enter image description here

SMOTE:

enter image description here

Out of 115 variables, I used different kinds of feature selection methods and correlation techniques to select features and model. Still I get unsatisfactory results. 10 fold Cross validation and xgboost leads to always unrecognized levels in new data set. I even checked whether data has been correctly read, checked the distributions of the entire data set with new data, checked train data distribution with new data. Nothing works. I guess there is no problem in modeling, but some issue in the data, leading to a large bias.

Is there any point that I am basically missing? Reproducing an example on these lines seem hard to me for users here to understand.

Ray
  • 321
  • 2
  • 12
  • I split the data set as train & test after scaling and I scaled new data without invoking the earlier scales. This was fundamentally wrong and caused problems. I have corrected everything based on this discussion and it works: https://stackoverflow.com/questions/62209496/scaling-production-data – Ray Jun 17 '20 at 11:25

1 Answers1

0

In the absence of the code you are using and of a sample of the data you have fed to your algorithm this is a tough question to answer.

cousin_pete
  • 578
  • 4
  • 15
  • I split the data set as train & test after scaling and I scaled new data without invoking the earlier scales. This was fundamentally wrong and caused problems. I have corrected everything based on this discussion and it works: stackoverflow.com/questions/62209496/scaling-production-data – Ray Jun 17 '20 at 11:25