0

I'm looking for a way to get an idea of the impact of the features I'm using in a classification problem. Using sklearn's logistic regression classifier (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), I understood that the .coef_ attribute gets me the information I'm after (as also discussed in this thread: How to find the importance of the features for a logistic regression model?).

The first few lines of my matrix:

phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True

Where the first line is the header, followed by the data (using the preprocessor's LabelEncoder in my code to convert this to ints).

Now, when I do a

print(classifier.coef_)

I get

[[ 0.84768459 -0.56344453  0.00365928  0.21441586 -1.70290447 -0.18460676
   1.6167634   0.08556331  0.02152226 -0.05111953  0.07310608 -0.073653  ]]

which contains 12 columns/elements. I'm confused by this, since my data contains 13 columns (plus the 14th one with the label, I'm separating the features from the labels later on in my code). I was wondering if maybe sklearn expects/assumes the first column to be the id and doesn't actually use the value of this column? But I cannot find any info on this.

Any help here would be much appreciated!

Igor
  • 1,251
  • 10
  • 21
  • The [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) says that `coef_ ` should of shape (1, n_features) when the given problem is binary, so it looks like something is wrong. Can you post some code, so someone can have a look? – Stev Apr 13 '18 at 09:26
  • please provide [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – MaxU - stand with Ukraine Apr 13 '18 at 09:34
  • 1
    Please print your X_train.shape which you feed into classifier.fit method. Looks like you accidentially dismissed useful column. – Alexey Trofimov Apr 13 '18 at 10:53
  • 1
    Thanks @Alexey, this pointed me in the right direction. If you could briefly look at the post below and could confirm my understanding, that'd be great! – Igor Apr 13 '18 at 12:21

1 Answers1

0

Not sure how to edit my original question in a way that it would still make sense for future reference, so I'll post a minimal example here:

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy

headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]


df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))

testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
    if index < splitIndex:
        testrows.append(row)
    else:
        trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)

classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)

I think I may have found the source of the error (thanks @Alexey Trofimov for pointing me in the right direction). My code at first contained:

train_features = traindf.iloc[:,1:len(headers)-1]

Which was copied from another script, where I did have id's as the first column in my matrix, hence didn't want to take these into account. The len(headers)-1 then, if I understand things correctly, is to not take into account the actual label. Testing this in a real world scenario, deleting the -1 results in perfect f-score, which would make sense, since it would just only look at the actual label and always predict correctly... So I now changed this to

train_features = traindf.iloc[:,0:len(headers)-1]

as in the code snippet, and now get 13 columns (in X_train.shape, and consequently in classifier.coef_). I think this solved my issue, but am still not 100% convinced, so if someone could point out an error in this line of reasoning/my code above, I'd be grateful to hear about it.

Igor
  • 1,251
  • 10
  • 21
  • 1
    Is `is_topic` your label? If so, more conventional code would be something like:`y = 'is_topic' X = df.drop(['is_topic'], axis=1).columns` then you refer to your labels using `df[y]` and your features using `df[X]` – Stev Apr 13 '18 at 12:25
  • Yep, that's indeed the label. Ok, thanks for the tip! I'll use that form for future attempts. – Igor Apr 13 '18 at 12:36
  • 1
    No problem, just trying to help :) Can I ask why you are doing your train/test split in such a way? Normally you would have a random element to the split and most people just use sklearn's `train_test_split`. It's easy to replicate if you don't want to use sklearn though. If you must stick with your method, then may I suggest something like `testrows=df.iloc[:splitIndex]` and `trainrows=df.iloc[splitIndex:]` to avoid looping through your dataframe? – Stev Apr 13 '18 at 12:41
  • Thanks for yet another useful tip :). The reason for doing it this way is that I'm wrapping this code in an x-fold cross-validation loop, where in each run I want to cover a different piece of the matrix as test set (as opposed to x runs, with randomised test sets each time). The looping through the dataframe is indeed a bit inefficient, but hasn't been a real issue (in terms of execution time) so far. Mostly running with max 5k data instances so far. – Igor Apr 13 '18 at 12:48
  • 1
    Ok, you seem like you know what you're doing :) I would perhaps have a look at using `cross_val_score` though because you are basically just doing 10 fold cross validation without shuffling before selecting the folds. By default, shuffle is switched off, so the dataset remains ordered. To get roc_auc, you would do something like `cross_val_score(classifier , df[X], df[y], scoring='roc_auc', cv=StratifiedKFold(n_splits=10, shuffle = False))`. – Stev Apr 13 '18 at 13:00