2

I want to select top K features using SelectKBest and run GaussianNB:

selection = SelectKBest(mutual_info_classif, k=300)

data_transformed = selection.fit_transform(data, labels)
new_data_transformed = selection.transform(new_data)

classifier = GaussianNB()
classifier.fit(data_transformed, labels)
y_predicted = classifier.predict(new_data)
acc = accuracy_score(new_data_labels, y_predicted)

However, I do not get consistent results for accuracy on the same data. The accuracy has been:

0.61063743402354853
0.60678034916768164 
0.61733658140479086 
0.61652456354039786 
0.64778725131952908 
0.58384084449857898

For the SAME data. I don't do splits etc. I just use two static sets of data and new_data.

Why do the results vary? How do I make sure I get the same accuracy for the same data?

Uylenburgh
  • 1,277
  • 4
  • 20
  • 46

1 Answers1

2

This is because their is some randomness in the data or variables. This depends on the Random number generator used internally by the estimators or functions, in your case it is mutual_info_classif which you pass into SelectKBest.

Have a look at the usage of random_state here and in this answer

As a workaround you can insert the following line on top of your code.

np.random.seed(some_integer)

This will set the numpy's seed to the some_integer and as far as I know, scikit estimators uses numpy's random number generator. See this for more details

Community
  • 1
  • 1
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132