I am currently training a svc for a dataframe with a lot of columns with one of these columns as a target
df.rename(columns={'Sequence': 'target'}, inplace=True)
cat_pipeline = Pipeline([
('imputer', SimpleImputer(fill_value='none', strategy='constant')),
('one_hot', OneHotEncoder())])
full_pipeline = ColumnTransformer([
('num', StandardScaler(), num),
('cat', cat_pipeline, string_col)
])
clf = make_pipeline(StandardScaler(), SVC())
X = df.iloc[:, :-1]
y = df.target
X = full_pipeline.fit_transform(X)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=101)
linear = svm.SVC(kernel='linear', degree=3, C=1).fit(X_train, y_train)
When I type linear.coef_, I get a sparse matrix, then I change it to a dataframe
c = linear.coef_.tocoo()
matrix = pd.DataFrame({"node1": c.row, "node2": c.col, "poids": c.data})
id | node1 | node2 | poids |
---|---|---|---|
0 | 0 | 56 | -0.010062 |
1 | 0 | 62 | -0.000089 |
2 | 0 | 83 | 0.0090587 |
7576 | 14 | 1030 | -0.000089 |
But I don't understand what are node1 and node 2, I have searched around and found that according to this thread : The dimension of dual_coef_ in sklearn. SVC
It should be class 0 vs class 1 etc, but my node1 goes between 0 and 14, and node2 between 0 and 1063. So I guess I have 15 classes (node 1) and node 2 are the features, but I only have 210 columns on this example (I am not running my programm on all of the data i would get more columns).So what are the 1063 node2 and is there a way to find the weight of each of my colums ? I have tried a lot of code that i have found on internet, but most of them assumed that my colums were my features (an example of what i have found that isn't working) :
def f_importances(coef, names):
imp = coef
imp,names = zip(*sorted(zip(imp,names)))
plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)
plt.show()
features_names = df.iloc[:, :-1].columns
f_importances(linear.coef_, features_names)
This code results in ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() But anyway it looks like it wouldn't work (I tried to change linear.coef_ with linear.coef_[0], matrix, and some other things and i succedeed in removing the error but it still doesn't work).
Thanks a lot for anything and sorry for my bad english i am not a native user.
Edit : Thanks to Ben Reiniger answer, I now know that OneHotEncoder divide my columns into different ones (no real idea why) so that explains the 1063 columns in node 2, in order to find the most important features of my dataframe I found with :
clf['full_pipeline'].transformers_[1][1]['one_hot'].get_feature_names(string_col)
which columns is transformed into which one, but only for object type columns. You need to add the string columns to this in order to have 1063 columns.
In order to get the right order i did some panda manipulation and finally have the link between my original columns and the coefficients in my svm.
Thanks !