This question explains how to add your own words to the built-in English stop words of CountVectorizer
. I'm interested in seeing the effects on a classifier of eliminating any numbers as tokens.
ENGLISH_STOP_WORDS
is stored as a frozen set, so I guess my question boils down (unless there's a method I don't know) to if it's possible to add an arbitrary number represnetation to a frozen list?
My feeling on the question is that it's not possible, since the finiteness of the list you have to pass precludes that.
I suppose one way to accomplish the same thing would be to loop through the test corpus and pop words where word.isdigit()
is true to a set/list that I can then union with ENGLISH_STOP_WORDS
(see previous answer), but I'd rather be lazy and pass something simpler to the stop_words
parameter.