0

Let's say we have 10 independent variable x1,x2,x3,...xn which all are categorical with same levels 0,1,2 (eg., 0 = no color , 1 = Red, 2 = Green) and you have two dependent(response) variables (eg., y1 = pant length in m and y2 = waist size in m). How do we determine which independent variables (x1,x2,x3,...xn) drives the dependent variables (y1 and y2)?

Example of the data is as follows:

| x1 | x2 | x3 | x4 | x5 | x6 | x7  | x8 | x9 | x10 | size(y1) | length(y2) |

|----|----|----|----|----|----|-----|----|----|-----|----------|------------|

|  0 |  1 |  2 |  1 |  0 |  0 |   2 |  1 |  0 |   2 |     0.36 |       0.84 |
|  0 |  1 |  1 |  0 |  2 |  1 |   0 |  2 |  0 |   1 |     0.84 |       1.23 |
|  1 |  2 |  0 |  1 |  0 |  1 |   0 |  1 |  0 |   2 |     1.92 |       3.86 |

I tried PLS regression in python and here is my code

import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', header = 0)

X =  pd.DataFrame.as_matrix(df[[x for x in df.columns if x not in ['waist_size', 'pant_length']]])
Y =  pd.DataFrame.as_matrix(df[[''waist_size', 'pant_length'']])

from sklearn.cross_decomposition import PLSRegression
pls = PLSRegression(n_components = 8)
pls.fit(X,Y)
coef = pls.coef_
sorted_index = np.argsort(np.abs(pls.coef_))

Actual result from this approach is as follows: I am getting a numpy array for all the rows in the dataset and is as follows

[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[0, 1],
[1, 0]
.....

How to interpret this?

And, is there is a way to calculate direct correlations and feature importance in this kind of problems?

  • related: https://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python – Ray Tayek Apr 24 '19 at 02:08
  • Ray, my question was, can correlation be calculated between three variables - i.e., correlation between x1 and combinatiom of (y1, y2) – akhil reddy Apr 24 '19 at 03:37
  • https://en.wikipedia.org/wiki/Multiple_correlation – Ray Tayek Apr 24 '19 at 07:35
  • https://stackoverflow.com/questions/42128462/in-python-how-to-do-correlation-between-multiple-columns-more-than-2-variables – Ray Tayek Apr 24 '19 at 07:37
  • The wikipedia link is correlation between one dependent variable (y1) and multiple independent variables(x1, x2,..., xn). The stack overflow link is correlation between multiple variables. My question is specifically on how can we calculate correlation between (y1,y2) and(x1, x2,..., xn) – akhil reddy Apr 25 '19 at 08:09
  • Based on my research one way Manova might be of help – akhil reddy Apr 25 '19 at 08:13
  • maybe like this: https://stats.stackexchange.com/questions/4517/regression-with-multiple-dependent-variables – Ray Tayek Apr 26 '19 at 04:24

1 Answers1

0

You can use Principal Component Analys (PCA) for this aim (as for me, PCA is better for your aim then PLS). But, as for your question, you get 2 vectors because you are training PLS2, not PLS1 (Y- is a vector 10*2). You must use pls.x_loadings_ (if you want to work with only pls).

https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 09 '22 at 02:01