Different results between using train_test_split and cross_val_score in sklearn.cross_validation with randomized data

Question

I am performing preliminary tests using sklearn in my code.

I am testing:

1) sklearn.cross_validation.cross_val_score

2) sklearn.cross_validation.train_test_split

like in this question.

The code is the following:

#X is my data and Y the corresponding binary labels                                                  

#My classifier                                                                     
clf = svm.SVC(class_weight='auto', kernel=kernel, gamma=gamma,                   
    degree=degree, cache_size=cache_size,probability=probability)                   


#1st method: ShuffleSplit and cross validation                                      
cv = cross_validation.ShuffleSplit(X.shape[0], n_iter=5,                            
    test_size=0.4, random_state=0)                                                  
#Scoring                                                                            
scores = cross_validation.cross_val_score(clf, X, Y,                     
    cv=cv, n_jobs=3, scoring="roc_auc")                                             

#2nd method: train_test_split                                                       
X_train, X_test, y_train, y_test = cross_validation.train_test_split(               
    X, Y, test_size=0.4, random_state=42)                                           

clf.fit(X_train, y_train)                                                          
pred_test = clf.predict(X_test)                                                     
#Scoring                                                                            
score = roc_auc_score(y_test, pred_test)

The difference with the other question is that my data is being randomized in both cases 1) and 2).

However I get for case 1) the following scores:

[ 0.9453893   0.94878745  0.95197478  0.95150763  0.94971746]

and for case 2):

0.867637

I actually quite not understand the reason of this different scores and cannot get what I'm missing here.

Shouldn't the scorings be similar ?

Thank you for your time.

Try setting your `random_state` in your classifier: `clf = svm.SVC(class_weight='auto', kernel=kernel, gamma=gamma, degree=degree, cache_size=cache_size,probability=probability, random_state=0) ` — EdChum, Aug 28 '15 at 10:23

Tim Vivian-Griffiths · Answer 1 · 2018-09-07T08:44:52.357

I know that I'm late to this, but I have just been having a similar problem and happening to stumble upon this post. I was having exactly the same issues when comparing answers using train_test_split and cross_val_score - using the roc_auc_score metric.

I think that the problem is arising from putting the predicted binary outputs from the classifier into the roc_auc_score comparison. This means that the metric only has two arrays of binary outputs to make the score from. If you try using 'predict_proba' instead, this will give you an array with two columns (presuming that you have a two-class problem here) of the probabilities of the classes for the different sample point.

On my dataset, I took the second column of this and entered it into roc_auc_score along with the true values, and this returned answer that were far more in line with the output of cross_val_score.

UPDATE:

On having learnt some more (and reading the docs!) - this isn't the best way to go about this, as it requires setting probability=True for the SVC, and this is far more computationally expensive. Instead of using either predict or predict_proba, use decision_function instead, and then enter these values into the roc_auc_score as the predicted values.

UPDATE:

In response to a comment made on this process, I've also attached a couple of figures to explain this process. I'll also provide some background information that aided me when learning about this.

The Receiver Operating Characteristic curve is made by seeing changes in the relative amounts of true vs false positives as the threshold for a decision boundary changes from strict to more relaxed. This explanation, however, can seem somewhere inscrutable, so a figure is provided here. This shows the decision boundary for a linear Support Vector Machine on some generated data with 2 features, the 'blue' class and the 'red'class. The solid line represents the threshold for binary decisions that is found by training the SVM. All of the points represent data that was used to train the model. Any new data can be added to the plot; if they appear on the bottom left, they will be labelled 'red', and in the top right 'blue'. We can think of the 'red' as the 'positive' class, and therefore the output from prediction is a binary {0, 1} output (red = 1, blue = 0).

One thing to notice is that the data points are not perfectly linearly separable, there is a region in the model near the decision boundary where the red and blue points overlap a lot. Therefore, a linear model here cannot ever get perfect performance.

The dotted lines represent the margins of the SVM. The training of the SVM aims to maximise the width of this margin, and is very dependant on the hyper-parameter C value provided. In effect, higher values of C will force the model to fit better to the training data, whereas lower values will allow for misclassifications here, with the intent of having better generalisability for new and unseen data. A full description can be seen in the scikit-learn docs: http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html#sphx-glr-auto-examples-svm-plot-svm-margin-py. Note that all of the points that are either misclassified, or appear in this margin region. The other points, we are super confident about being correct.

So to the main point, how the AUC is calculated. I have added two extra lines on this graph, the red and blue boundary lines. These can be thought of as a shift of the main decision line from a highly selective area, where only the most confident red points are actually classified as red, to a very relaxed boundary, where every point will be classed as red. Remember, any point to the bottom right of this moving threshold will be classed as red.

Initially, no data points meet the criteria to be classified as red, but as the line moves in the direction of the arrows, it starts to scoop up these points. In the early stages, all of these are correct as all of the data points are red, but as we head towards the margin area, we soon start to pick up false positives (blue points) while getting more of the reds. This pattern of collecting true and false positives at different rates, affects the ROC curve. The best way to show this is with another figure:

Imagine that we start to draw the curve from the bottom left, and make a small movement any time we change the threshold position. As we collect the true, red, positives, we draw our line in the y-axis direction, but as we collect blues, we draw in the x-axis direction. The aim is to send the line as close to the top left corner as possible, as in the end we will take the Area-Under-the-Curve (AUC) as our metric. Note that at the end, the line always gets to the top right (as eventually, all the data points will be classed as red), and in this case it is just travelling along the top of the graph. This is because, in this dataset, as the threshold moves closer to the blue line, we are only getting false positives.

Now imagine 2 very different situations: if the data were perfectly linearly separable, so none of the training data points were on the 'wrong' side of the boundary, the ROC line would always head directly up the y-axis until it gets to the top left, and the head along the top of the graph to the top-right, giving an AUC of 1. However, if the data points were just a cloud of noise, all mixed in the centre, you would get false positives at the same rate as true positives, and your line would head in the direction of the diagonal line and give an AUC of 0.5. Hence why this value represents complete chance level of performance.

I am not a contributor to scikit-learn, and I haven't examined the source code here, but I can imagine that the roc_auc_score uses the values from decision_function or predict_proba as a representation for how confident the model is of the point being a positive (in our case red) class. Therefore the same logic of relaxing the boundary and looking at the changing rates of true to false positives still holds. If this is not-right, then please correct me.

Hi, I am also having a similar problem here. Can you please explain why the result becomes similar to that of cross_val_score if we take the second column value of predict_proba? In addition, some classifiers do not have the decision_function (e.g. decision tree). — Charlotte, Aug 23 '18 at 06:54
Hi Charlotte, sorry for the late reply on this, but I have just tried to give a better explanation of the process in a recent edit. The second column of predict_proba represents the probability of the data point being in the positive class and represents how 'confident' the model is, which allows for the threshold based AUC to be calculated correctly. This puts the output similar to cross_val_score. Let me know if you have any other questions. — Tim Vivian-Griffiths, Sep 07 '18 at 08:48
@TimVivian-Griffiths Indeed, you can see [in the code](https://github.com/scikit-learn/scikit-learn/blob/7389dbac82d362f296dc2746f10e43ffa1615660/sklearn/metrics/scorer.py#L182) that the `decision_function` is used. Many thanks for taking the time to answer — Lili_marston, Jan 03 '19 at 08:52

Different results between using train_test_split and cross_val_score in sklearn.cross_validation with randomized data

1 Answers1