0

I am facing a multiclass classification problem where I have 4 classes and one of them dominates over the others. I use a KNN classification model and the majority of the instances are being classified as the majority class. I used the weights = 'distance' parameter and it did improve, but not all what I expected. I know that adjusting the classification thresholds of each class can improve the classification in the classes with fewer instances, but I don't know how to do it. My code is this:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import accuracy_score
    from sklearn import metrics
    import numpy as np
    from sklearn.model_selection import cross_val_score
    from scipy.spatial.distance import braycurtis
    from sklearn.metrics import confusion_matrix


    df_X = pd.read_csv('df_data.csv')
    df_Y = pd.read_csv('df_Class.csv')

    X_train, X_test, Y_train, Y_test = train_test_split(df_X, df_Y, random_state=42)

    knn = KNeighborsClassifier(n_neighbors = 5, metric = braycurtis, weights = 'distance')
    knn.fit(X_train, Y_train)
    Y_pred = knn.predict(X_test)

    acc_score = accuracy_score(Y_test, Y_pred)
    print("Acierto de KNN en la partición de test:", acc_score)
    print(metrics.classification_report(Y_test,Y_pred))

    m_confusion = confusion_matrix(Y_test, Y_pred)
    print(m_confusion)

and my results are these:

               precision    recall  f1-score   support

           1       0.64      0.39      0.48       244
           2       0.77      0.49      0.60       371
           3       0.56      0.95      0.71       626
           4       0.64      0.34      0.44       408

    accuracy                           0.61      1649
   macro avg       0.65      0.54      0.56      1649
weighted avg       0.64      0.61      0.58      1649

[[ 94   4 126  20]
 [ 21 182 127  41]
 [ 10   6 592  18]
 [ 23  43 204 138]]

Thank you very much!

0 Answers0