8

I am trying to select the best features using chi-square (scikit-learn 0.10). From a total of 80 training documents I first extract 227 feature, and from these 227 features I want to select the top 10 ones.

my_vectorizer = CountVectorizer(analyzer=MyAnalyzer())      
X_train = my_vectorizer.fit_transform(train_data)
X_test = my_vectorizer.transform(test_data)
Y_train = np.array(train_labels)
Y_test = np.array(test_labels)
X_train = np.clip(X_train.toarray(), 0, 1)
X_test = np.clip(X_test.toarray(), 0, 1)    
ch2 = SelectKBest(chi2, k=10)
print X_train.shape
X_train = ch2.fit_transform(X_train, Y_train)
print X_train.shape

The results are the following.

(80, 227)
(80, 14)

They are similar if I set k equal to 100.

(80, 227)
(80, 227)

Why does this happen?

*EDIT: A full output example , now without clipping, where I request 30 and got 32 instead:

Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
 [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 32)
Using 32(requested:30) best features from 9 training documents
get support:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True]
get support with vocabulary :
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
  scale_C)
Classifying...

Another example without clipping, where I request 10 and get 11 instead:

Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
 [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 11)
Using 11(requested:10) best features from 9 training documents
get support:
[ True  True  True False False  True False False False False  True False
 False False  True False False False  True False  True False  True  True
 False False False False  True False False False]
get support with vocabulary :
[ 0  1  2  5 10 14 18 20 22 23 28]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
  scale_C)
Classifying...
Gyan Veda
  • 6,309
  • 11
  • 41
  • 66
D T
  • 677
  • 12
  • 23

1 Answers1

5

Have you checked what is returned from the get_support() function (ch2 should have this member function)? This returns the indices being selected among the best k.

My conjecture is that there are ties due to the data clipping that you're doing (or due to repeated feature vectors, if your feature vectors are categorical and are likely to have repeats), and that the scikits function returns all entries that are tied for the top k spots. The extra example where you set k = 100 casts some doubt on this conjecture, but it's worth a look.

See what get_support() returns, and check what X_train looks like on those indices, see if clipping results in a lot of feature overlap, creating ties in the chi^2 p-value ranks that SelectKBest is using.

If this turns out to be the case, you should file a bug / issue with scikits.learn, because currently their documentation does not say what SelectKBest will do in the event of ties. Clearly it can't just take some of the tied indices and not others, but users should at least be warned that ties could result in unexpected feature dimensionality reduction.

ely
  • 74,674
  • 34
  • 147
  • 228
  • Thanks for your reply. I tried removing the clipping but still does not work as expected... I have edited my question with a full output (including get_support), could you please take a look at it? – D T Apr 30 '12 at 06:09
  • Tie breaking is indeed the problem. I'll discuss with the other developers whether this should be changed in the code or in the documentation. – Fred Foo Apr 30 '12 at 12:48
  • @larsmans I grabbed the last commit from github (1550-g258f193) and now works flawlessly. Thanks for such a quick support, that was amazing. However, I still cannot understand what the problem was ("tie-breaking" between X and Y?) though... – D T May 01 '12 at 01:42
  • 1
    @DT: when you selected, say, 100 features and the 100th and 101st feature had the same p-value for the chi² test, `SelectKBest` would return the 101st as well, and possibly also the 102nd etc. It now returns always exactly 100 features. – Fred Foo May 01 '12 at 09:46