Let's say we have 10 independent variable x1,x2,x3,...xn which all are categorical with same levels 0,1,2 (eg., 0 = no color , 1 = Red, 2 = Green) and you have two dependent(response) variables (eg., y1 = pant length in m and y2 = waist size in m). How do we determine which independent variables (x1,x2,x3,...xn) drives the dependent variables (y1 and y2)?
Example of the data is as follows:
| x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | size(y1) | length(y2) |
|----|----|----|----|----|----|-----|----|----|-----|----------|------------|
| 0 | 1 | 2 | 1 | 0 | 0 | 2 | 1 | 0 | 2 | 0.36 | 0.84 |
| 0 | 1 | 1 | 0 | 2 | 1 | 0 | 2 | 0 | 1 | 0.84 | 1.23 |
| 1 | 2 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 1.92 | 3.86 |
I tried PLS regression in python and here is my code
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', header = 0)
X = pd.DataFrame.as_matrix(df[[x for x in df.columns if x not in ['waist_size', 'pant_length']]])
Y = pd.DataFrame.as_matrix(df[[''waist_size', 'pant_length'']])
from sklearn.cross_decomposition import PLSRegression
pls = PLSRegression(n_components = 8)
pls.fit(X,Y)
coef = pls.coef_
sorted_index = np.argsort(np.abs(pls.coef_))
Actual result from this approach is as follows: I am getting a numpy array for all the rows in the dataset and is as follows
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[0, 1],
[1, 0]
.....
How to interpret this?
And, is there is a way to calculate direct correlations and feature importance in this kind of problems?