I am trying to solve the DAT102x: Predicting Mortgage Approvals From Government Data since a couple of months.
My goal is to understand the pieces of a classification problem, not to rank to the top.
However, I found something that is not clear to me:
I get almost the same performance out of a Cross Validate model based on a sample (accuracy = 0.69) , as the one scored using this model on the whole dataset (accuracy = 0.69).
BUT when I submit the data using the competition dataset I get a "beautiful" 0.5.
- It sounds like a overfitting problem
but I assume that an overfitting problem would be spotted by a CV...
The only logical explanation that I have is that the CV fails because is based on a sample that I created using the "train_test_split" function.
In other words: because I used this way of sampling, my sample has become a kind of FRATTAL: whatever the sub-sample I create, it is always a very precise reproduction of the population.
So: the CV "fails" to detect overfitting.
Ok. I hope I have been able to explain what is going on.
(btw if you wonder why I do not check it running the full population: I am using a HP core duo 2.8 Mhz 8 RAM... it takes forever....)
here the steps of my code:
0) prepare the dataset (NaN, etc) and transform everything into categorical (numerical-->binning)
1) use train_test_split to sample 12.000 records out of 500K dataset 2) encode (selected) categorical with OHE 3) reduce Features via PCA 4) perform CV to identify best Log_reg hyperparameter "C" value 5) split the sample using train_test_split: holding 2000 records out of 12000 6) build a Log_reg model based on Xtrain,y_train (accuracy: 0.69) 7) fitting the whole dataset into the log_reg model (accuracy: 0.69) 8) fitting the whole competition dataset into the log_reg model 9) getting a great 0.5 accuracy result....
The only other explanation I have is that I selected a bunch of features that are kind of "over-ridden" in the competition dataset, by those I left out. (you know: the competition guys are there to make us sweat...)
also here I have a "hardware problem" to shake the numbers.
If any one has any cloud about this I am happy to learn.
thx a lot
- I also tried with Random Forest but I get the same problem. In this case I understand that OHE is not something that sklearn RF model loves: it jeopardize the model and smashes down valuable Features with "many" categories
happy to share if requested.
I would expect one of these two:
or: a model that has a poor performance on the whole dataset
or: a model that has a comparable (0.66?) performance on the competition dataset.