EDIT: this is the question I ultimately was trying to ask: Understanding min_df and max_df in scikit CountVectorizer
I was reading the documentation for the scikit-learn CountVectorizer, and noticed that when discussing max_df
, we are concerned with document frequency for tokens:
max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
But when we consider max_features
, we are interested in term frequency:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
I am confused: if we use max_df
, and say we set it to 10, aren't we saying, "Ignore any token that shows up more than 10 times" ?
And if we set max_features
to 100, aren't we saying, "Only use the 100 tokens that have the highest number of appearances across the corpus" ?
If I got this right...then what's the difference between the wording when using 'term frequency' and 'document frequency'?