Should I always do train/test split before the feature selection process?

Asked Mar 31 '21 at 10:39

Active Mar 31 '21 at 10:48

Viewed 1,009 times

I have seen Should Feature Selection be done before Train-Test Split or after? thread and read it. A person had explained there very good. However, is it a must? I mean, if I use the whole dataset for any issue or data without splitting it for feature selection processes. Do I always overfit?

For example, I have tried Boruta for my whole dataset. It has given me 23 features. However, I have also tried Boruta for only with my train set and test set. It has resulted 15 and 11 features in order.

How can I understand there if I overfit or not?

edited Mar 31 '21 at 10:48

Peter O.

32,158
14
82
96

asked Mar 31 '21 at 10:39

hotan

1

You should ask this question on stats.stackexchange.com, rather than here. – Peter O. Mar 31 '21 at 10:48
Agree with the comment above, this should be in stats. In short, the explanation in your link is already very good and should point out the problem with using test set in model creation: you're using what should be unknown data in model creation. Don't necessarily think in terms of overfitting data, but your model evaluation (using the test set) then cannot be trusted to be accurate, since your model is not evaluated on unseen data, as you used the test data to create that model. – dm2 Mar 31 '21 at 10:56
I have written there the same question. However, that site seems dead. – hotan Mar 31 '21 at 16:38

Should I always do train/test split before the feature selection process?

0 Answers0