Our team is currently using CatBoost to develop credit scoring models, and our current process is to...
- Sort the data chronologically for out-of-time sampling, and split it into train, valid, and test sets
- Perform feature engineering
- Perform feature selection and hyperparameter tuning (mainly learning rate) on train, using valid as an eval set for early stopping
- Perform hyperparameter tuning on the combination of train and valid, using test as an eval set for early stopping
- Evaluate the results of Step #4 using standard metics (RMSE, ROC AUC, etc.)
However, I am concerned that we may be overfitting to the test set in Step #4.
In Step #4, should we just be refitting the model on train and valid without tuning (i.e., using the selected features and hyperparameters from Step #3)?
The motivation for having a Step #4 at all is to train the models on more recent data due to our out-of-time sampling scheme.