I was trying to figure out a way of finding a correlation between continuous variables and a non-binary target categorical label. The only thing I though of is by fitting the labels into Multinomial Logistic Regression and then extracting the coefficients for every class.
I also thought of one-hot encoding the classes but the result was full of NaNs when I used it with pd.corr() - a logical thing to happen.
Is there a better way to do this?
Imagine the following dataset:
Feature 1 | Feature 2 | Feature 3 | Class |
---|---|---|---|
10 | 20 | 30 | A |
20 | 40 | 60 | A |
30 | 60 | 90 | B |
40 | 80 | 120 | B |
50 | 100 | 150 | C |
60 | 120 | 180 | C |
EDIT:
Two example outputs could be the following:
Example A
Class | Feature 1 | Feature 2 | Feature 3 |
---|---|---|---|
A | 0.33 | 0.67 | 0 |
B | 0.15 | 0.35 | 0.5 |
C | 0.1 | 0.2 | 0.7 |
Example B
Feature 1 | Feature 2 | Feature 3 | |
---|---|---|---|
Class | 0.25 | 0.15 | 0.6 |
I am achieving something like Example A, but I am not sure if the result is robust.
EDIT 2:
This is what I was doing so far:
# import libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
# create the dataframe
df = pd.DataFrame({
"Class" : ["High", "High", "Medium", "Medium", "Low", "Low"],
"Feature 1" : [10, 20, 30, 40, 50, 60],
"Feature 2" : [20, 40, 60, 80, 100, 120],
"Feature 3" : [30, 90, 120, 150, 180, 210]
})
# define the model
lr = LogisticRegression(multi_class = "multinomial", solver = "lbfgs", max_iter = 10000, random_state = 1)
# fit the model
lr.fit(df.copy().drop(columns = ["Class"]), df["Class"])
# get importances
importances = lr.coef_
# for each class
for i, label in enumerate(df["Class"].unique().tolist()):
# print the importance of each feature
print("Class:", label, "\n")
for j, featureImportance in enumerate(importances[i]):
print("Feature " + str(j + 1) + ":", featureImportance)
print("\n")
Output:
Class: High
Feature 1: -0.07064219154435081
Feature 2: -0.14128438308870161
Feature 3: -0.21196087185231213
Class: Medium
Feature 1: 0.07064617336110231
Feature 2: 0.14129234672220461
Feature 3: 0.21196040102827662
Class: Low
Feature 1: -3.981997391518259e-06
Feature 2: -7.963994783036519e-06
Feature 3: 4.732779961757555e-07