-1

I'm using scikit learn with the code shown below.

I have class imbalance (roughly a 90:10 split of class 0:1). After reading a number of other questions I've used the class_weighted parameter.

However every time I run the code I get a different set of important features and different AOC, precision, recall etc.

The problem is not there when I remove the class_weight parameter.

As shown, I've set the random_state to be constant so this is not the issue. A good number of the predictors are highly correlated. Does anyone know what the issue is? (Note I posted a similar question yesterday but this was downvoted as I hadn't been clear enough so rather than have a long chain of comments I've deleted the question which is hopefully clearer to others and now provides the information needed).

x_train, x_test, y_train, y_test = train_test_split(x, y)

parameters = {
    'max_depth': [6,7, 8],
    'min_samples_split': [100, 150],
    'min_samples_leaf': [50,75]
    }

clf = GridSearchCV(DecisionTreeClassifier(
    random_state=99,
    class_weight='balanced'), 
    parameters, refit=True, cv=10) 

clf.fit(x_train, y_train.ravel()

# create main tree using best settings
clf2 = DecisionTreeClassifier(
    max_depth=clf.best_params_['max_depth'],
    min_samples_split=clf.best_params_['min_samples_split'],
    min_samples_leaf=clf.best_params_['min_samples_leaf'],
    random_state=99,
    class_weight='balanced')

clf2.fit(x_train, y_train.ravel()) 
A Rob4
  • 1,278
  • 3
  • 17
  • 35
  • As I also said in the previous (now deleted) question. Setting a random_state at only one place will not make the code a duplicateble one. You need to check the splitting. – Vivek Kumar Nov 23 '17 at 11:19
  • `train_test_split` also uses random shuffling to shuffle the data and then split. So thats one place of randomness. Then GridSearchCV will also have randomness due to `cv` param. – Vivek Kumar Nov 23 '17 at 11:22
  • Arh I see now what you mean! yes you are correct - thank you. As I said, I'm new to all this and didn't know that could be the cause of the problem. Marked as answered. – A Rob4 Nov 23 '17 at 11:37

1 Answers1

1

In the above code, there are multiple points of randomness.

1) train_test_split uses random shuffling to shuffle the data and then split into train and test. So first you need to stabilize that.

2) GridSearchCV uses a cv parameter, which for classification tasks uses a StratifiedKFold() to split the data into different folds. So that's also randomness there.

Workaround: Please set this line in your code before processing the data (Better if on top, just below the import lines).

numpy.random.seed(SOME_INTEGER)

Use numpy or np as you have imported.

Explanation: Please see the below questions:-

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132