I am facing a multiclass classification problem where I have 4 classes and one of them dominates over the others. I use a KNN classification model and the majority of the instances are being classified as the majority class. I used the weights = 'distance'
parameter and it did improve, but not all what I expected. I know that adjusting the classification thresholds of each class can improve the classification in the classes with fewer instances, but I don't know how to do it. My code is this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
import numpy as np
from sklearn.model_selection import cross_val_score
from scipy.spatial.distance import braycurtis
from sklearn.metrics import confusion_matrix
df_X = pd.read_csv('df_data.csv')
df_Y = pd.read_csv('df_Class.csv')
X_train, X_test, Y_train, Y_test = train_test_split(df_X, df_Y, random_state=42)
knn = KNeighborsClassifier(n_neighbors = 5, metric = braycurtis, weights = 'distance')
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_score = accuracy_score(Y_test, Y_pred)
print("Acierto de KNN en la partición de test:", acc_score)
print(metrics.classification_report(Y_test,Y_pred))
m_confusion = confusion_matrix(Y_test, Y_pred)
print(m_confusion)
and my results are these:
precision recall f1-score support
1 0.64 0.39 0.48 244
2 0.77 0.49 0.60 371
3 0.56 0.95 0.71 626
4 0.64 0.34 0.44 408
accuracy 0.61 1649
macro avg 0.65 0.54 0.56 1649
weighted avg 0.64 0.61 0.58 1649
[[ 94 4 126 20]
[ 21 182 127 41]
[ 10 6 592 18]
[ 23 43 204 138]]
Thank you very much!