0

I am performing a url classification (phishing - nonphishing) and I plotted the learning curves (training vs cross validation score) for my model (Gradient Boost).

My View

It seems that these two curves converge and the difference is not significant. Tt's normal for the training set to have a slightly higher accuracy). (Figure 1)

Gradient boost learning curves

The Question

I have limited experience on machine learning, thus I am asking your opinion. Is the way I am approaching the problem right? Is this model fine or is it overfitting?

Note: The classes are balanced and the features are well chosen

Relevant code

from yellowbrick.model_selection import LearningCurve

def plot_learning_curves(Χ, y, model):

       # Create the learning curve visualizer
       cv = StratifiedKFold(n_splits=5)
       sizes = np.linspace(0.1, 1.0, 8)
       visualizer = LearningCurve(model, cv=cv, train_sizes=sizes, n_jobs=4)
       visualizer.fit(Χ, y)  # Fit the data to the visualizer
       visualizer.poof()
Dimi
  • 309
  • 5
  • 25
  • 1
    This is the topic of bias versus variance tradeoff, Look into it for more information. Also, It is not overfitting as you can see by increasing the examples you increase bias not variance(overfitting). – Kartikey Singh Oct 03 '19 at 14:42
  • 1
    Is "training instances" the number of points you use for training ? – Florian Mutel Oct 03 '19 at 14:46
  • @FlorianMutel yes. – Dimi Oct 03 '19 at 14:48
  • 1
    In response to your last comment in your now [deleted question](https://stackoverflow.com/questions/61752620/suspiciously-low-false-positive-rate-with-gaussian-naive-bayes-classifier), try [Data Scince SE](https://datascience.stackexchange.com/help/on-topic), but I would kindly suggest to try making your question more focused. – desertnaut May 12 '20 at 13:31
  • 1
    @desertnaut Thank you very much for your suggestions. Much appreciated. – Dimi May 12 '20 at 14:10

1 Answers1

1

Firstly, in your graph there are 8 different models.

It's hard to tell if one of them is overfitting because overfitting can be detected with a "epoch vs performance (train / valid)" graph (there would be 8 in your case).

Overfitting means that, after a certain number of epochs, as the number of epoch increases, training accuracy goes up while validation accuracy goes down. This can be the case, for example, when you have too few data points regarding the complexity of your problem, hence your model is using spurious correlations.

With your graph, what we can say is that the complexity of your problem seems to require a "high" number or training instances because your validation performance keep increasing as you add more training instances. There is a chance that the model with <10000 is overfitting but your >50000 could be overiftting too and we don't see that because you are using early stopping!

Hope it helps

Florian Mutel
  • 1,044
  • 1
  • 6
  • 13
  • Could you please explain further why there are 8 models in my learning curve? I only see one, and the graph displays its accuracy as the datapoints increase. – Dimi Oct 03 '19 at 16:22
  • I added the code in the main question, that may help. – Dimi Oct 03 '19 at 16:24
  • Each point of your graph is one model that is trained with N data points with the parameters defined by your "model" object – Florian Mutel Oct 03 '19 at 16:35
  • Thank you for your valuable help. Do you know how I can print "epoch vs performance (train / valid)" graph on python sklearn? – Dimi Oct 03 '19 at 16:56
  • Couldnt find something quickly in the sklearn API but it is stored in your estimators properties. You can take a look at this question https://stackoverflow.com/questions/46912557/is-it-possible-to-get-test-scores-for-each-iteration-of-mlpclassifier even if it is a different estimator the API should be similar. – Florian Mutel Oct 03 '19 at 17:23