1

So, I've been using KNN on a set of data, with a random_state = 4 during the train_test_split phase. Despite of using the random state, the output of accuracy, classification report, prediction, etc, are different each time. Was wondering why was that?

Here's the head of the data: (predicting the position based on all_time_runs and order)

order position  all_time_runs
0     10   NO BAT           1304
1      2  CAN BAT           7396
2      3   NO BAT           6938
3      6  CAN BAT           4903
4      6  CAN BAT           3761

And here's the code for the classification and prediction:

#splitting data into features and target

X = posdf.drop('position',axis=1)
y = posdf['position']   


knn = KNeighborsClassifier(n_neighbors = 5)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

#fitting the KNN model
knn.fit(X_train, y_train)

#predicting with the model
prediction = knn.predict(X_test)

#knn score
score = knn.score(X_test, y_test)
CurtLH
  • 2,329
  • 4
  • 41
  • 64
  • Another argument of `train_test_split()` function by default is `shuffle=True`. It is for whether or not to shuffle the data before splitting. It is always recommended in many training functions to shuffle your data. However, if you want to come up with same result, you need to give `shuffle=False` in that function. – aminrd Oct 04 '19 at 18:56
  • 2
    Then what is the purpose of random_state in this case if its not producing the same result everytime? Like what impact would it have on my output if I was to have random_state = None without changing the shuffle? – TheFutureNav Oct 04 '19 at 19:00
  • 1
    Also, it still giving me a different result even if I turn shuffle = False. So lost... – TheFutureNav Oct 04 '19 at 19:01

1 Answers1

0

Althought train_test_split has a random factor associated to it, and it has to be solved to avoid having random resuls, it's not the only factor you should work on solving.

KNN is a model that takes each row of the test set, finds the nearest k training set vectors and classifies it by majority decision and even in case of ties, the decision is random. You need to set.seed(x) in order to ensure the method is replicable.

Documentation states:

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Celius Stingher
  • 17,835
  • 6
  • 23
  • 53