I am developing a spam filter using Scikit.
Here are the steps I follow:
Xdata = ["This is spam" , "This is Ham" , "This is spam again"]
Matrix
=Countvectorizer (XData)
. Matrix will contains count of each word in all documents. So Matrix[i][j] will give me counts of wordj
in documenti
Matrix_idfX
=TFIDFVectorizer(Matrix)
. It will normalize score.Matrix_idfX_Select
=SelectKBest( Matrix_IdfX , 500)
. It will reduce matrix to 500 best score columnsMultinomial.train( Matrix_Idfx_Select)
Now my question Do I need to perform normalization or standardization in any of the above four steps ? If yes then after which step and why?
Thanks