I think using train_test_split to sample a large data set and then use cross_validation on the sample may be wrong. agree?

Question

I am trying to solve the DAT102x: Predicting Mortgage Approvals From Government Data since a couple of months.

My goal is to understand the pieces of a classification problem, not to rank to the top.

However, I found something that is not clear to me:

I get almost the same performance out of a Cross Validate model based on a sample (accuracy = 0.69) , as the one scored using this model on the whole dataset (accuracy = 0.69).

BUT when I submit the data using the competition dataset I get a "beautiful" 0.5.

It sounds like a overfitting problem

but I assume that an overfitting problem would be spotted by a CV...

The only logical explanation that I have is that the CV fails because is based on a sample that I created using the "train_test_split" function.

In other words: because I used this way of sampling, my sample has become a kind of FRATTAL: whatever the sub-sample I create, it is always a very precise reproduction of the population.

So: the CV "fails" to detect overfitting.

Ok. I hope I have been able to explain what is going on.

(btw if you wonder why I do not check it running the full population: I am using a HP core duo 2.8 Mhz 8 RAM... it takes forever....)

here the steps of my code:

0) prepare the dataset (NaN, etc) and transform everything into categorical (numerical-->binning)

1) use train_test_split to sample 12.000 records out of 500K dataset 2) encode (selected) categorical with OHE 3) reduce Features via PCA 4) perform CV to identify best Log_reg hyperparameter "C" value 5) split the sample using train_test_split: holding 2000 records out of 12000 6) build a Log_reg model based on Xtrain,y_train (accuracy: 0.69) 7) fitting the whole dataset into the log_reg model (accuracy: 0.69) 8) fitting the whole competition dataset into the log_reg model 9) getting a great 0.5 accuracy result....

The only other explanation I have is that I selected a bunch of features that are kind of "over-ridden" in the competition dataset, by those I left out. (you know: the competition guys are there to make us sweat...)

also here I have a "hardware problem" to shake the numbers.

If any one has any cloud about this I am happy to learn.

thx a lot

I also tried with Random Forest but I get the same problem. In this case I understand that OHE is not something that sklearn RF model loves: it jeopardize the model and smashes down valuable Features with "many" categories

happy to share if requested.

I would expect one of these two:

or: a model that has a poor performance on the whole dataset

or: a model that has a comparable (0.66?) performance on the competition dataset.

It seems you are selecting features *before* performing CV, which is a classic mistake and can lead to issues such as the ones you describe; see my two (2) answers in [Should Feature Selection be done before Train-Test Split or after?](https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after) — desertnaut, Jun 12 '19 at 10:40
thx...read the post. a bit tough but I understand the basic. I need a little more help if I may ask.... I see I made the mistake twice? 1) I did a PCA before the CV so for sure the automatic split within the CV has been done over a set of data compromised. ( how to avoid this....? ) 2) I did a sampling of the original dataset using train_test_split. and then built a PCA on the sample. in this case I am not sure this compromises the CV ( how to sample differently?) 3) when I built the "final model" I split a set of data that was already passed through the PCA — CRAZYDATA, Jun 12 '19 at 12:08
So I think that I should have: 1] got the sample: 12000. 2] split it: 10000+2000 3] split the 10000: 9000+1000 4] perform the PCA on the 9000 5] perform a CV on the 10000, using (how?) the pca.transform of the 9000 on the k-1 Folds of each of the validations 6] used the pca.transform of the 9000 to transform the 10000 and get the accuracy .... does it make sense? if yes any idea how to perform step #5? performed a CV on the train_sample where the PCA is done on each (k-1) folds and verfied on the k fold — CRAZYDATA, Jun 12 '19 at 12:09
I think it seems that this may help: https://datascience.stackexchange.com/questions/45860/pca-smote-and-cross-validation-how-to-combine-them-together — CRAZYDATA, Jun 12 '19 at 12:52
Actually, since PCA is an *unsupervised* technique (i.e. it does not use the labels), it would seem that there is no problem using it in the whole training set - see the last point made by Hastie & Tibshirani in my blog post [How NOT to perform feature selection!](https://www.nodalpoint.com/not-perform-feature-selection/). I have to think more about it though... — desertnaut, Jun 12 '19 at 13:09
I found a hint on line where it said that the overfitting is generated by the fact that I built most of my features based on the label. I can smell this make sense: is like feeding the model with a set of data that know the solution as well. I will give a try. If so, this "overfitting by creating features based on label" is not much debated on line. but to me makes sense. — CRAZYDATA, Jun 12 '19 at 19:58

I think using train_test_split to sample a large data set and then use cross_validation on the sample may be wrong. agree?

0 Answers0