Correlation between continuous variables and multi class categorical variables in python

Question

I was trying to figure out a way of finding a correlation between continuous variables and a non-binary target categorical label. The only thing I though of is by fitting the labels into Multinomial Logistic Regression and then extracting the coefficients for every class.

I also thought of one-hot encoding the classes but the result was full of NaNs when I used it with pd.corr() - a logical thing to happen.

Is there a better way to do this?

Imagine the following dataset:

Feature 1	Feature 2	Feature 3	Class
10	20	30	A
20	40	60	A
30	60	90	B
40	80	120	B
50	100	150	C
60	120	180	C

EDIT:

Two example outputs could be the following:

Example A

Class	Feature 1	Feature 2	Feature 3
A	0.33	0.67	0
B	0.15	0.35	0.5
C	0.1	0.2	0.7

Example B

	Feature 1	Feature 2	Feature 3
Class	0.25	0.15	0.6

I am achieving something like Example A, but I am not sure if the result is robust.

EDIT 2:

This is what I was doing so far:

    # import libraries
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    
    # create the dataframe
    df = pd.DataFrame({
        "Class" : ["High", "High", "Medium", "Medium", "Low", "Low"],
        "Feature 1" : [10, 20, 30, 40, 50, 60],
        "Feature 2" : [20, 40, 60, 80, 100, 120],
        "Feature 3" : [30, 90, 120, 150, 180, 210]
    })
    
    # define the model
    lr = LogisticRegression(multi_class = "multinomial", solver = "lbfgs", max_iter = 10000, random_state = 1)
    # fit the model
    lr.fit(df.copy().drop(columns = ["Class"]), df["Class"])
    # get importances
    importances = lr.coef_
    
    # for each class
    for i, label in enumerate(df["Class"].unique().tolist()):
      # print the importance of each feature
      print("Class:", label, "\n")
      for j, featureImportance in enumerate(importances[i]):
        print("Feature " + str(j + 1) + ":", featureImportance)
      print("\n")

Output:

    Class: High 
    
    Feature 1: -0.07064219154435081
    Feature 2: -0.14128438308870161
    Feature 3: -0.21196087185231213
    
    
    Class: Medium 
    
    Feature 1: 0.07064617336110231
    Feature 2: 0.14129234672220461
    Feature 3: 0.21196040102827662
    
    
    Class: Low 
    
    Feature 1: -3.981997391518259e-06
    Feature 2: -7.963994783036519e-06
    Feature 3: 4.732779961757555e-07

@U12-Forward a correlation between the features and the classes. The way I am doing this with the Multinomial Logistic Regression, I get different coefficients for all the different labels. I try to find a result as if Class was a continuous variable. — zoump, Sep 07 '21 at 04:08
Please post an expected example dataframe, for the example data you posted, — U13-Forward, Sep 07 '21 at 04:09
I think this is a stats question in first instance rather than a programming question. One classical way to test a relationship between categorical variables and continuous variables is ANOVA. There is a [discussion in CrossValidated](https://stats.stackexchange.com/questions/190984/anova-vs-multiple-linear-regression-why-is-anova-so-commonly-used-in-experiment) that might be helpful. — TMBailey, Sep 26 '21 at 15:16
Similar: https://stackoverflow.com/questions/34052115/how-to-find-the-importance-of-the-features-for-a-logistic-regression-model — Ammar N. Abbas, Dec 14 '21 at 07:45

Correlation between continuous variables and multi class categorical variables in python

0 Answers0