2

I was trying to figure out a way of finding a correlation between continuous variables and a non-binary target categorical label. The only thing I though of is by fitting the labels into Multinomial Logistic Regression and then extracting the coefficients for every class.

I also thought of one-hot encoding the classes but the result was full of NaNs when I used it with pd.corr() - a logical thing to happen.

Is there a better way to do this?

Imagine the following dataset:

Feature 1 Feature 2 Feature 3 Class
10 20 30 A
20 40 60 A
30 60 90 B
40 80 120 B
50 100 150 C
60 120 180 C

EDIT:

Two example outputs could be the following:

Example A

Class Feature 1 Feature 2 Feature 3
A 0.33 0.67 0
B 0.15 0.35 0.5
C 0.1 0.2 0.7

Example B

Feature 1 Feature 2 Feature 3
Class 0.25 0.15 0.6

I am achieving something like Example A, but I am not sure if the result is robust.

EDIT 2:

This is what I was doing so far:

    # import libraries
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    
    # create the dataframe
    df = pd.DataFrame({
        "Class" : ["High", "High", "Medium", "Medium", "Low", "Low"],
        "Feature 1" : [10, 20, 30, 40, 50, 60],
        "Feature 2" : [20, 40, 60, 80, 100, 120],
        "Feature 3" : [30, 90, 120, 150, 180, 210]
    })
    
    # define the model
    lr = LogisticRegression(multi_class = "multinomial", solver = "lbfgs", max_iter = 10000, random_state = 1)
    # fit the model
    lr.fit(df.copy().drop(columns = ["Class"]), df["Class"])
    # get importances
    importances = lr.coef_
    
    # for each class
    for i, label in enumerate(df["Class"].unique().tolist()):
      # print the importance of each feature
      print("Class:", label, "\n")
      for j, featureImportance in enumerate(importances[i]):
        print("Feature " + str(j + 1) + ":", featureImportance)
      print("\n")

Output:

    Class: High 
    
    Feature 1: -0.07064219154435081
    Feature 2: -0.14128438308870161
    Feature 3: -0.21196087185231213
    
    
    Class: Medium 
    
    Feature 1: 0.07064617336110231
    Feature 2: 0.14129234672220461
    Feature 3: 0.21196040102827662
    
    
    Class: Low 
    
    Feature 1: -3.981997391518259e-06
    Feature 2: -7.963994783036519e-06
    Feature 3: 4.732779961757555e-07
zoump
  • 151
  • 2
  • 5
  • What's the desired output? – U13-Forward Sep 07 '21 at 04:05
  • @U12-Forward a correlation between the features and the classes. The way I am doing this with the Multinomial Logistic Regression, I get different coefficients for all the different labels. I try to find a result as if Class was a continuous variable. – zoump Sep 07 '21 at 04:08
  • Please post an expected example dataframe, for the example data you posted, – U13-Forward Sep 07 '21 at 04:09
  • I think this is a stats question in first instance rather than a programming question. One classical way to test a relationship between categorical variables and continuous variables is ANOVA. There is a [discussion in CrossValidated](https://stats.stackexchange.com/questions/190984/anova-vs-multiple-linear-regression-why-is-anova-so-commonly-used-in-experiment) that might be helpful. – TMBailey Sep 26 '21 at 15:16
  • Similar: https://stackoverflow.com/questions/34052115/how-to-find-the-importance-of-the-features-for-a-logistic-regression-model – Ammar N. Abbas Dec 14 '21 at 07:45

0 Answers0