30

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:

Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:

         "feature1"   "feature2"
"Doc1"    .44          .22
"Doc2"    .11          .6
"Doc3"    .22          .2

B are my target values for the data, which are just numbers 1-100 associated with each document:

"Doc1"    50
"Doc2"    11
"Doc3"    99

Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.

jeffrey
  • 3,196
  • 7
  • 26
  • 44

8 Answers8

33

What I found to work was:

X = your independent variables

coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)

The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

Kirsche
  • 331
  • 3
  • 2
15

You can do that by creating a data frame:

cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)
Pran Kumar Sarkar
  • 953
  • 12
  • 26
  • 1
    regression.coef_ is now returned as a dataframe so to do this cdf = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(regression.coef_))], axis = 1) – tim.newport Nov 04 '21 at 02:58
9
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})
Snowde
  • 91
  • 1
  • 1
8

I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.

Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.

Jake0x32
  • 1,402
  • 2
  • 11
  • 18
  • 1
    This is true as long as regression.coef_ returns coefficinet values in the same order. Thanks. – jeffrey Nov 16 '14 at 00:55
  • The ExtraTreesClassifier is actually very interesting, but it seems there is no way to retrieve the actual features which it picked after the model has been fit? – jeffrey Nov 16 '14 at 01:17
  • @jeffrey Yes, but I always select feature by `clf.feature_importances_ ` to retrieve the importance ranking of features. Well intuitively it is just like the coefficients of the Linear Model, isn't it? – Jake0x32 Nov 16 '14 at 01:41
  • 1
    Well, if you use a feature selection method like a CountVectorizer(), it has a method get_feature_names(). Then you can map get_feature_names() to .coef_ (i think they are in order, I'm not sure). However, you cannot do this with the tree. – jeffrey Nov 16 '14 at 01:56
4

Coefficients and features in zip

print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))

Coefficients and features in DataFrame

pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})

enter image description here

Ankit Kumar Rajpoot
  • 5,188
  • 2
  • 38
  • 32
3

This is the easiest and most intuitive way:

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)

or the same but transposing index and columns

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T
Pablo Vilas
  • 546
  • 5
  • 13
1

Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:

pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T
clieforce
  • 11
  • 2
0

Try putting them in a series with the data columns names as index:

coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Hanan Tabak
  • 21
  • 1
  • 5