I'm trying to the bag-of-words algorithm in a certain column of my dataframe, which is made of 6723 rows. But when I apply the tf-idf to the column the vocabulary returned is too big, 8357 words more precisely.
# ...
statements = X_train[:, 0]
tf_idf = TfidfVectorizer()
tf_idf_vectorizer = tf_idf.fit_transform(statements).toarray()
vocabulary = tf_idf.vocabulary_
print(len(vocabulary)) # 8357
print(tf_idf.stop_words_) # set()
print(len(tf_idf.stop_words_)) # 0
After read the documentation I found that we can add the max_df parameter which supposed to ignore words that have frequency higher than the given threshold, so I did this in order to ignore words that have frequency higher than 50%:
#...
tf_idf = TfidfVectorizer(max_df=0.5)
print(len(vocabulary)) # 8356
print(tf_idf.stop_words_) # {'the'}
print(len(tf_idf.stop_words_)) # 1
So, as you can see the results were not too good and I think that I'm doing something wrong, becouse there are other words that have high frequencies that weren't removed, such as: to, in, of, etc. So am I doing something wrong? How can I fix that?