-1

I am trying to classify animal subjects with similar genotypes into 4 classes. The data are labeled and we know the genotype being assigned to each measured subject. I'm able to get 97% test accuracy using Random Forest classifier with no over/under fitting. However, my problem is that the genotypes are not fully distinct in reality and there might be some interrelation/co-variance between them. So, instead of identifying the distinct genotype for new instances, I would like to find the probability of belonging a new instance to any of the four classes (For example, 80% class 1, 10% class 2, 10% class 3)

I have just learned about the Gaussian Mixture Model (GMM) in Scikit-learn. So, my question is: first, if the GMM would be the appropriate method to solve this problem, and second, suggestions for other algorithms that can help.

  • You need to post some data sample, reproducible code if relevant, expected output for us to be able to debug this thing. From what you are asking, you should really see into hierarchical classification. Check if that matches your use case. – LazyCoder Jul 27 '19 at 20:03
  • Sure, thanks for your feedback. I'll post some codes soon. – Ali Nikooyan Jul 27 '19 at 20:27
  • I think I found the solution. It would be Multinomial Logistic Regression. – Ali Nikooyan Jul 28 '19 at 02:13

1 Answers1

0

I think I found the solution. It would be Multinomial Logistic Regression.

from sklearn.linear_model import LogisticRegression

# Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers

solvers = ['lbfgs', 'sag', 'saga','newton-cg']
clf = LogisticRegression(random_state=0, solver=solvers[0],
                     multi_class='multinomial').fit(X_train, y_train)
y_pred = clf.predict_proba(X_test) 

y_pred_proba = clf.predict_proba(X_test) 

clf.score(X_test, y_test)

#For example
# Probability of classes 0 to 3, for the third test instance:
print(y_pred_proba[3,:])
array([0.00094984, 0.65902225, 0.33647559, 0.00355232])
  • Practically all classifiers behave that way, not only logistic regression; you may find the discussion in [Predict classes or class probabilities?](https://stackoverflow.com/questions/51367755/predict-classes-or-class-probabilities/51423325#51423325) useful. And this is not considered multi-label classification (it's simply multi-class). – desertnaut Jul 28 '19 at 11:43
  • Thank you very much! both for sharing this amazing resource and also for correcting my terminology. – Ali Nikooyan Jul 28 '19 at 18:15