How to ignore words that occurs more than 50% using sci-kit learn TfidfVectorizer?

Question

I'm trying to the bag-of-words algorithm in a certain column of my dataframe, which is made of 6723 rows. But when I apply the tf-idf to the column the vocabulary returned is too big, 8357 words more precisely.

# ...

statements = X_train[:, 0]

tf_idf = TfidfVectorizer()
tf_idf_vectorizer = tf_idf.fit_transform(statements).toarray()
vocabulary = tf_idf.vocabulary_

print(len(vocabulary)) # 8357
print(tf_idf.stop_words_) # set()
print(len(tf_idf.stop_words_)) # 0

After read the documentation I found that we can add the max_df parameter which supposed to ignore words that have frequency higher than the given threshold, so I did this in order to ignore words that have frequency higher than 50%:

#...

tf_idf = TfidfVectorizer(max_df=0.5)

print(len(vocabulary)) # 8356
print(tf_idf.stop_words_) # {'the'}
print(len(tf_idf.stop_words_)) # 1

So, as you can see the results were not too good and I think that I'm doing something wrong, becouse there are other words that have high frequencies that weren't removed, such as: to, in, of, etc. So am I doing something wrong? How can I fix that?

No `max_df` dont remove words having higher frequency than given threshold. `max_df=0.5` will remove words that occur in more than 50% of the unique documents. See [this answer for details](https://stackoverflow.com/a/35615151/3374996). — Vivek Kumar, Sep 17 '18 at 13:29
I want to ignore all words that occurs in more than 50% of ALL documents — flpn, Sep 17 '18 at 14:24
Did you pre-process your training set? Except if it's a particular dataset, 8357 different words sounds like a lot. Did you try to remove the stop words, uncapitalize, tokenize and/or lemmize the text? — Igor OA, Sep 19 '18 at 13:25

How to ignore words that occurs more than 50% using sci-kit learn TfidfVectorizer?

0 Answers0