14

I have a matrix with 20 columns. The last column are 0/1 labels.

The link to the data is here.

I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this:

  1. using sklearn.cross_validation.cross_val_score
  2. using sklearn.cross_validation.train_test_split

I am getting different results when I do what I think is pretty much the same exact thing. To exemplify, I run a two-fold cross validation using the two methods above, as in the code below.

import csv
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

#read in the data
data = pd.read_csv('data_so.csv', header=None)
X = data.iloc[:,0:18]
y = data.iloc[:,19]

depth = 5
maxFeat = 3 

result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False), X, y, scoring='roc_auc', cv=2)

result
# result is now something like array([ 0.66773295,  0.58824739])

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)

RFModel = ensemble.RandomForestClassifier(n_estimators=1000, max_depth=depth, max_features=maxFeat, oob_score=False)
RFModel.fit(xtrain,ytrain)
prediction = RFModel.predict_proba(xtest)
auc = roc_auc_score(ytest, prediction[:,1:2])
print auc    #something like 0.83

RFModel.fit(xtest,ytest)
prediction = RFModel.predict_proba(xtrain)
auc = roc_auc_score(ytrain, prediction[:,1:2])
print auc    #also something like 0.83

My question is:

why am I getting different results, ie, why is the AUC (the metric I am using) higher when I use train_test_split?

Note: When I using more folds (say 10 folds), there appears to be some kind of pattern in my results, with the first calculation always giving me the highest AUC.

In the case of the two-fold cross validation in the example above, the first AUC is always higher than the second one; it's always something like 0.70 and 0.58.

Thanks for your help!

Mooncrater
  • 4,146
  • 4
  • 33
  • 62
evianpring
  • 3,316
  • 1
  • 25
  • 54
  • is your data randomized initially? If I remember right, one or both of the two methods default to splitting the data with no randomization. That might explain the "pattern" you refer to, though it probably wouldn't explain the poorer overall results with the first method (it might though) – KCzar May 21 '15 at 04:27
  • No, the data is not randomized initially. It does seem to be a good explanation for why the results exhibit the same pattern in cross_val_score. I guess the only random part of cross_val_score in my case is the fact that the randomForestClassifier has some randomness in its algorithm for choosing features in its trees. Other than that , if it's just cutting up the data into n folds based on the initial ordering, then maybe that's the problem. I'll check it out in a few hours when I actuallly wake up, it's the middle of the night here! – evianpring May 21 '15 at 07:16
  • so, yeah, this worked: p = np.random.permutation(len(y)) Result = cross_val_score(ensemble.RandomForestClassifier(n_estimators=1000, max_depth=5, max_features=3, oob_score=False), X[p], y[p], scoring='roc_auc', cv=2) – evianpring May 21 '15 at 08:04

2 Answers2

14

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

The KFolds iterator has a random state parameter:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

So does train_test_split, which does randomize by default:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

KCzar
  • 1,024
  • 1
  • 9
  • 11
  • I have a question on **train_test_split**. In the above code `xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.50)`, how does the algorithm know what value must go into `xtrain`, `xtest` etc? How does it know that `xtrain` must contain the result of _training the independent variables_ of the dataset, and similarly on `xtest`. I don't suppose it is the fact that the variable contains 'train', or 'test' in it. Is it? Thanks for your help. – Anonymous Person Apr 06 '17 at 13:59
1

The answer is what @KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:

p = np.random.permutation(len(X))
X, y = X[p], y[p]

source: Better way to shuffle two numpy arrays in unison

Sajad.sni
  • 187
  • 4
  • 15