Do I need to standardize data when doing text classification in Scikit

Question

I am developing a spam filter using Scikit. Here are the steps I follow:

Xdata = ["This is spam" , "This is Ham" , "This is spam again"]

Matrix = Countvectorizer (XData) . Matrix will contains count of each word in all documents. So Matrix[i][j] will give me counts of word j in document i
Matrix_idfX = TFIDFVectorizer(Matrix). It will normalize score.
Matrix_idfX_Select = SelectKBest( Matrix_IdfX , 500) . It will reduce matrix to 500 best score columns
Multinomial.train( Matrix_Idfx_Select)

Now my question Do I need to perform normalization or standardization in any of the above four steps ? If yes then after which step and why?

Thanks

score 1 · Answer 1 · edited May 23 '17 at 11:58

1

You may want to normalize words before tokenization (stemming or lemmatization). See the related question for example.

NB: you don't need since "TfidfVectorizer combines all the options of CountVectorizer and TfidfTransformer in a single model" (scikit docs) Also note that "While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable." (the same docs)

edited May 23 '17 at 11:58

Community

1
1

answered May 30 '15 at 13:05

Nikita Astrakhantsev

4,701
1
15
26

If I use Binary then I should use bernaulli then multinomial classifier .. right ?? – voila May 30 '15 at 15:05
no, you should use Bernoulli, if you have short texts – Nikita Astrakhantsev May 30 '15 at 15:41

Do I need to standardize data when doing text classification in Scikit

1 Answers1