1

I am currently training a svc for a dataframe with a lot of columns with one of these columns as a target


df.rename(columns={'Sequence': 'target'}, inplace=True)

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(fill_value='none', strategy='constant')),
    ('one_hot', OneHotEncoder())])

full_pipeline = ColumnTransformer([
    ('num', StandardScaler(), num),
    ('cat', cat_pipeline, string_col)
])


clf = make_pipeline(StandardScaler(), SVC())


X = df.iloc[:, :-1]
y = df.target
X = full_pipeline.fit_transform(X)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=101)
linear = svm.SVC(kernel='linear', degree=3, C=1).fit(X_train, y_train)

When I type linear.coef_, I get a sparse matrix, then I change it to a dataframe

c = linear.coef_.tocoo()
matrix = pd.DataFrame({"node1": c.row, "node2": c.col, "poids": c.data})

id node1 node2 poids
0 0 56 -0.010062
1 0 62 -0.000089
2 0 83 0.0090587
7576 14 1030 -0.000089

But I don't understand what are node1 and node 2, I have searched around and found that according to this thread : The dimension of dual_coef_ in sklearn. SVC

It should be class 0 vs class 1 etc, but my node1 goes between 0 and 14, and node2 between 0 and 1063. So I guess I have 15 classes (node 1) and node 2 are the features, but I only have 210 columns on this example (I am not running my programm on all of the data i would get more columns).So what are the 1063 node2 and is there a way to find the weight of each of my colums ? I have tried a lot of code that i have found on internet, but most of them assumed that my colums were my features (an example of what i have found that isn't working) :

def f_importances(coef, names):
    imp = coef
    imp,names = zip(*sorted(zip(imp,names)))
    plt.barh(range(len(names)), imp, align='center')
    plt.yticks(range(len(names)), names)
    plt.show()

features_names = df.iloc[:, :-1].columns
f_importances(linear.coef_, features_names)

This code results in ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() But anyway it looks like it wouldn't work (I tried to change linear.coef_ with linear.coef_[0], matrix, and some other things and i succedeed in removing the error but it still doesn't work).

Thanks a lot for anything and sorry for my bad english i am not a native user.

Edit : Thanks to Ben Reiniger answer, I now know that OneHotEncoder divide my columns into different ones (no real idea why) so that explains the 1063 columns in node 2, in order to find the most important features of my dataframe I found with :

clf['full_pipeline'].transformers_[1][1]['one_hot'].get_feature_names(string_col)

which columns is transformed into which one, but only for object type columns. You need to add the string columns to this in order to have 1063 columns.

In order to get the right order i did some panda manipulation and finally have the link between my original columns and the coefficients in my svm.

Thanks !

  • Are you accounting for the preprocessing (`full_pipeline.fit_transform`) in your feature count? What's the shape of `X_train`? – Ben Reiniger Apr 28 '21 at 15:11
  • X_train : <25x1063 sparse matrix of type '' with 4800 stored elements in Compressed Sparse Row format> But it's not the error that's really bothering me, i could find a fix, and i have kinda found another method, but in order to know if what i am doing is right, i would like to know what is "n_features" and why are my shape like that. Thanks anyway ! – Solal Peiffer-Smadja Apr 29 '21 at 06:44

1 Answers1

0

n_features refers to the number of features, i.e. the number of columns in your training dataset. In your case, that's the number of columns after the preprocessing full_pipeline; presumably some of your 210 original features get one-hot encoded or otherwise expanded into many columns.

n_classes is indeed the number of target classes, but note that the first dimension of coef_ is n_class * (n_class - 1)/2 (as in your title). You have 6 classes, so 6*(6-1)/2=15 rows in coef_. This is the number of pairs of classes: the SVM automatically performs one-vs-one classification, so you get a linear function for each pair of classes.

For determining feature importances, because of that one-vs-one behavior, you'll need to decide how to aggregate the coefficients of each feature across all the class-pairs. A columnwise-sum of the coef matrix seems reasonable, but there are always caveats with feature importances.

Your link to the question about dual_coef_ is a bit of a red herring, because here you're analyzing the primal formulation, not the dual.

Finally, having only 25 examples in your X_train, when you have many more features, is probably not ideal for drawing conclusions.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • Oh ok thanks ! I didn't think about the expansion into many columns, i will need to check how this works in order to find the weight of each of my 210 original features ! I will try the sum of the coef matrix and check if people are doing it like that on internet. And yeah you are right for the 25 sample it's just in order to make my svm run then i will put more samples. Thanks a lot for your help ! – Solal Peiffer-Smadja Apr 30 '21 at 07:51