Accuracy score for a KNN model (IRIS data)

Question

What might be some key factors for increasing or stabilizing the accuracy score (NOT TO significantly vary) of this basic KNN model on IRIS data?

Attempt

from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

iris = datasets.load_iris() 
X, y = iris.data[:, :], iris.target

Xtrain, Xtest, y_train, y_test = train_test_split(X, y)
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

knn = neighbors.KNeighborsClassifier(n_neighbors=4)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Sample Accuracy Scores

0.9736842105263158
0.9473684210526315
1.0
0.9210526315789473

Classification Report

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.79      1.00      0.88        11
           2       1.00      0.80      0.89        15

    accuracy                           0.92        38
   macro avg       0.93      0.93      0.92        38
weighted avg       0.94      0.92      0.92        38

Sample Confusion Matrix

[[12  0  0]
 [ 0 11  0]
 [ 0  3 12]]

What do you mean by stabilising the Accuracy? Do you want to find a good "k" value for this problem ? — phoxis, Jul 05 '19 at 02:32
You mean across the multiple runs? If yes, then why would you like to do that ? — phoxis, Jul 05 '19 at 02:33
Possible duplicate of [What is "random-state" in sklearn.model\_selection.train\_test\_split example?](https://stackoverflow.com/questions/49147774/what-is-random-state-in-sklearn-model-selection-train-test-split-example) — Venkatachalam, Jul 05 '19 at 06:34

Rheatey Bash · Answer 1 · 2019-07-05T18:00:05.217

1

There are only 3 classes available in iris dataset, Iris-Setosa, Iris-Virginica, and Iris-Versicolor.

Use this code. This gives me 97.78% accuracy

from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

iris = datasets.load_iris() 
X, y = iris.data[:, :], iris.target
Xtrain, Xtest, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0, train_size = 0.7)

scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

edited Jul 05 '19 at 18:00

answered Jul 05 '19 at 02:28

Rheatey Bash

779
6
17

Check new answer and please mark it correct if it worked. – Rheatey Bash Jul 05 '19 at 02:39
Change train_size = 0.7, it will give you 97.78% accuracy. – Rheatey Bash Jul 05 '19 at 02:41
How would you ensure that the model does not overfit ? – phoxis Jul 05 '19 at 02:57

phoxis · Accepted Answer · 2019-07-05T03:04:19.880

I would recommend tuning the k value for k-NN. As iris is a small dataset and nicely balanced, I will do the following:

For every value of `k` in range [2 to 10] (say)
  Perform a n-times k-folds crossvalidation (say n=20 and k=4)
    Store the Accuracy values (or any other metric)

Plot the scores based on the average and variance and select the value of k with the best k-value. The main target of crossvalidation is to estimate the test error, and based on that you select the final model. There will be some variance, but it should be less than 0.03 or something like that. That depends on the dataset and the number of folds you take. One good process is, for each value of k make a boxplot of all the 20x4 Accuracy values. Select the value of k for which the lower quantile intersects the upper quantile, or in simple words, in there is not too much change in the accuracy (or other metric values).

Once you select the value of k based on this, the target is to use this value to build the final model using the entire training dataset. Next, this can be used to predict new data.

On the other hand, for larger datasets. Make a separate test partition (as you did here), and then tune the k value on only the training set (using crossvalidation, forget about the test set). After selecting an appropriate k train the algorithm, use only the training set to train. Next, use the test set to report the final value. Never take any decision based on the test set.

Yet another method is train, validation, test partition. Train using the train set, and train models using different values of k , and then predict using the validation partition and list the scores. Select the best score based on this validation partition. Next use the train or train+validation set to train the final model using the value of k selected based on the validation set. Finally, take out the the test set and report the final score. Again, never use the test set anywhere else.

These are general methods applicable to any machine learning or statistical learning methods.

Immportant thing to note when you perform the partition (train,test or for crossvalidation), use stratified sampling so that in each partition the class ratios stay the same.

Read more about crossvalidation. In scikitlearn it is easy to do. If using R, you can use the caret.

Main thing to remember that the target is to train a function which generalises on new data, or performs well on new data, and not perform not only perform good on the existing data.

Hi phoxis, Why should he use multiple values of k because k =3 is already fixed. It is classifier not a clustering problem. — Rheatey Bash, Jul 05 '19 at 03:03
@RheateyBash but in that case the variance does not matter. If the variance is high for the value of 4, then it might be the case that the value 4 is not a good value for this problem. — phoxis, Jul 05 '19 at 03:05
Let’s suppose that there are three classes of iris dataset and if I select k anything other than 3 how would you justify correct classification. In fact it would become clustering problem instead of classification. Desired result would be never achieved. — Rheatey Bash, Jul 05 '19 at 03:09
I do not understand how you are relating the number of classes and the number of neighbours? As per the question the task is building a classifier model based on kNN. The number of classes shouldn't have anything to do with the number of neighbours. — phoxis, Jul 05 '19 at 03:11
Great. It's just now many existing datapoints do you see around yourself (new datapoint) based on which (maybe majority voting) decide who you are. — phoxis, Jul 05 '19 at 03:14

Accuracy score for a KNN model (IRIS data)

Attempt

Sample Accuracy Scores

Classification Report

Sample Confusion Matrix

2 Answers2