scikit learn: Problems creating customized CountVectorizer and ChiSquare

Question

I have the following code (based on the samples here), but it is not working:

[...]
def my_analyzer(s):
    return s.split()
my_vectorizer = CountVectorizer(analyzer=my_analyzer)
X_train = my_vectorizer.fit_transform(traindata)

ch2 = SelectKBest(chi2,k=1)
X_train = ch2.fit_transform(X_train,Y_train)
[...]

The following error is given when calling fit_transform:

AttributeError: 'function' object has no attribute 'analyze'

According to the documentation, CountVectorizer should be created like this: vectorizer = CountVectorizer(tokenizer=my_tokenizer). However, if I do that, I get the following error: "got an unexpected keyword argument 'tokenizer'".

My actual scikit-learn version is 0.10.

score 3 · Accepted Answer · answered Apr 29 '12 at 16:14

3

You're looking at the documentation for 0.11 (to be released soon), where the vectorizer has been overhauled. Check the documentation for 0.10, where there is no tokenizer argument and the analyzer should be an object implementing an analyze method:

class MyAnalyzer(object):
    @staticmethod
    def analyze(s):
        return s.split()

v = CountVectorizer(analyzer=MyAnalyzer())

http://scikit-learn.org/dev is the documentation for the upcoming release (which may change at any time), while http://scikit-learn/stable has the documentation for the current stable version.

answered Apr 29 '12 at 16:14

Fred Foo

355,277
75
744
836

Thanks! By the way, I also should transform the sparse matrix to array, right? Like this: `ch2.fit_transform(X_train.toarray(), Y_train)` Otherwise non-subscriptable error happens – D T Apr 29 '12 at 23:19
@DT: that should never be necessary for chi² feature selection, it's designed to handle sparse matrices. What's the next step in your pipeline? – Fred Foo Apr 30 '12 at 11:32
Hmm...Strange, chi square requires 2 arrays (X and Y) so I thought I had to convert the sparse matrix to an array...My full code (and a new problem with chisquare features) is [here](http://stackoverflow.com/questions/10378601/scikit-learn-desired-amount-of-best-features-k-not-selected), could you please take a look? – D T Apr 30 '12 at 11:41
I meant, if I call ch2.fit_transform with an sparse matrix, I get a "coo-matrix object is unsubscriptable " error, but goes well if I convert that to an array using toarray() – D T Apr 30 '12 at 11:51
1

@DT: looks like a bug, I'll look into it later today. For now, convert your coo-matrix to CSR format with `.tocsr()` instead of `.toarray()`, that will preserve the sparsity. – Fred Foo Apr 30 '12 at 12:05
Thanks again for your help! By using the CSR format now works, and selects the features I request without problems! Anyway, it's strange, since a numpy array of size [n_samples, n_features] should be ok according to the documentation... If you want to post your last comment in my other post [here](http://stackoverflow.com/questions/10378601/scikit-learn-desired-amount-of-best-features-k-not-selected), I can give yours as the answer. – D T Apr 30 '12 at 12:31
1

@DT: yes, a Numpy array should be ok, but in practice you don't want to densify a sparse matrix. I've had workstations freeze when attempting that on large sparse matrices. I've just pushed a patch upstream so chi² will work with COO matrices in 0.11. – Fred Foo Apr 30 '12 at 12:36
Just an update, I have been trying with more instances, and sometimes it seems that chisq does not give me the actual number of fts I request with k... – D T Apr 30 '12 at 15:59
1

@DT: if it returns more than `k` features, then I just fixed that in the dev version. Otherwise, consider [filing a bug report](https://github.com/scikit-learn/scikit-learn/issues/new). – Fred Foo Apr 30 '12 at 16:40

scikit learn: Problems creating customized CountVectorizer and ChiSquare

1 Answers1