Adding numbers to stop_words to scikit-learn's CountVectorizer

Question

This question explains how to add your own words to the built-in English stop words of CountVectorizer. I'm interested in seeing the effects on a classifier of eliminating any numbers as tokens.

ENGLISH_STOP_WORDS is stored as a frozen set, so I guess my question boils down (unless there's a method I don't know) to if it's possible to add an arbitrary number represnetation to a frozen list?

My feeling on the question is that it's not possible, since the finiteness of the list you have to pass precludes that.

I suppose one way to accomplish the same thing would be to loop through the test corpus and pop words where word.isdigit() is true to a set/list that I can then union with ENGLISH_STOP_WORDS (see previous answer), but I'd rather be lazy and pass something simpler to the stop_words parameter.

Jim K. · Accepted Answer · 2019-01-23T17:53:50.067

Instead of extending the stopword list, you can implement this as a custom preprocessor for the CountVectorizer. Below is a simple version of this shown in bpython.

>>> import re
>>> cv = CountVectorizer(preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower()))
>>> cv.fit(['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45'])
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function <lambda> at 0x109bbcb18>, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
>>> cv.vocabulary_
{u'sentence': 6, u'this': 7, u'is': 4, u'candy': 1, u'dogs': 2, u'second': 5, u'NUM': 0, u'eat': 3}

Precompiling the regexp would likely give some speedup over a large number of samples.

alemol · Answer 2 · 2017-12-01T16:23:50.900

import re
from sklearn.feature_extraction.text import CountVectorizer

list_of_texts = ['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45']

def no_number_preprocessor(tokens):
    r = re.sub('(\d)+', 'NUM', tokens.lower())
    # This alternative just removes numbers:
    # r = re.sub('(\d)+', '', tokens.lower())
    return r

for t in list_of_texts:
    no_num_t = no_number_preprocessor(t)
    print(no_num_t)

cv = CountVectorizer(input='content', preprocessor=no_number_preprocessor)
dtm = cv.fit_transform(list_of_texts)
cv_vocab = cv.get_feature_names()

print(cv_vocab)

Outs

this is sentence.

this is a second sentence.

NUM dogs eat candy

NUM NUM NUM NUM

['NUM', 'candy', 'dogs', 'eat', 'is', 'second', 'sentence', 'this']

Adding numbers to stop_words to scikit-learn's CountVectorizer

2 Answers2