How can I prevent TfidfVectorizer to get numbers as vocabulary

Question

I use TfidfVectorizer like this:

from sklearn.feature_extraction.text import TfidfVectorizer
stop_words = stopwords.words("english")
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=200)
xs['train'] = vectorizer.fit_transform(docs['train'])
xs['test'] = vectorizer.transform(docs['test']).toarray()

But when inspecting vectorizer.vocabulary_ I've noticed that it learns pure number features:

[(u'00', 0), (u'000', 1), (u'0000', 2), (u'00000', 3), (u'000000', 4)

I don't want this. How can I prevent it?

Iulius Curt · Accepted Answer · 2017-08-08T21:54:55.173

You could define the token_pattern when initing the vectorizer. The default one is u'(?u)\b\w\w+\b' (the (?u) part is just turning the re.UNICODE flag on). Could fiddle with that until you get what you need.

Something like:

vectorizer = TfidfVectorizer(stop_words=stop_words,
                             min_df=200,
                             token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b')

Another option (if the fact that numbers appear in your samples matter) is to mask all the numbers before vectorizing.

re.sub('\b[0-9][0-9.,-]*\b', 'NUMBER-SPECIAL-TOKEN', sample)

This way numbers will hit the same spot in your vectorizer's vocabulary and you won't completely ignore them either.

the first way is going to ignore unicode characters also. for example `ồ` — TomSawyer, Feb 20 '20 at 08:01

How can I prevent TfidfVectorizer to get numbers as vocabulary

1 Answers1

Linked