-4

EDIT: this is the question I ultimately was trying to ask: Understanding min_df and max_df in scikit CountVectorizer

I was reading the documentation for the scikit-learn CountVectorizer, and noticed that when discussing max_df, we are concerned with document frequency for tokens:

max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.  

But when we consider max_features, we are interested in term frequency:

max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

I am confused: if we use max_df, and say we set it to 10, aren't we saying, "Ignore any token that shows up more than 10 times" ?

And if we set max_features to 100, aren't we saying, "Only use the 100 tokens that have the highest number of appearances across the corpus" ?

If I got this right...then what's the difference between the wording when using 'term frequency' and 'document frequency'?

Community
  • 1
  • 1
Monica Heddneck
  • 2,973
  • 10
  • 55
  • 89
  • They are pretty much what it says on the tin - document frequency is a frequency of _documents_ (documents containing the term as fraction of all documents), term frequency is a frequency of _terms_. – pvg Jan 18 '16 at 08:12
  • https://en.wikipedia.org/wiki/Tf%E2%80%93idf – BrenBarn Jan 18 '16 at 08:18
  • I don't understand what "difference in wording" you're referring to. – BrenBarn Jan 18 '16 at 08:20
  • @pvg so if a term has a 'document frequency' of 0.5, that means it appeared in half texts in the corpus? This must really mess with the values in the idf if we use max_df = 0.5 – Monica Heddneck Jan 18 '16 at 08:22
  • @MonicaHeddneck: if you use `max_df` indiscriminately then yes, that's why `max_df` is applied on "corpus-specific stop words". – Michael Foukarakis Jan 18 '16 at 08:27
  • Really! I'm applying stopwords on my own, without using scikit-learn's tools. I'm probably murdering my tokens! – Monica Heddneck Jan 18 '16 at 08:30
  • @MonicaHeddneck yep. – pvg Jan 18 '16 at 08:39
  • @pvg so in actuality, `max_df` only applies to the 318 stopwords supplied by `stop_words` in sklearn?? Humm. Why do I care how many documents a stopword appeared in -- I though something like 'a' or 'the', being stopwords by definition, should be completely removed regardless of their `max_df`! – Monica Heddneck Jan 18 '16 at 08:42

1 Answers1

0

When you set max_df to 10, you say that "Ignore any token that shows up in more than 10 documents" .. here you don't consider the number of times the token appears in each document, just the number of documents it appears in.

When you set max_features to 100, it means "Order the tokens(in descending order) by term frequency across the corupus(which means the number of times the token has appeared in each document across the corpus), and then consider only the first 100 of those tokens"

Kaustav Datta
  • 403
  • 2
  • 9
  • This is not right -- the range of `max_df` is 0.0 to 1.0. – tripleee Jan 18 '16 at 08:24
  • @tripleee: max_df can either accept a float (proportion of documents) or an int (raw number of documents). – BrenBarn Jan 18 '16 at 08:25
  • it can also be `int` according to the description in the question ... the description states that if it is int, then you consider absolute counts ... I have considered the case of 10 since that was the example given by the OP – Kaustav Datta Jan 18 '16 at 08:26
  • 1
    But so then, what is the interpretation if it is a float in the indicated range? Sounds like it should be called `max_idf` then? – tripleee Jan 18 '16 at 08:26
  • @tripleee: Just read [the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). – BrenBarn Jan 18 '16 at 08:28
  • the description states that `If float, the parameter represents a proportion of documents` ...so if the value is say 0.4, it ignores tokens that appear in more than 40% of the documents in the corpus – Kaustav Datta Jan 18 '16 at 08:29
  • I'm glad there is at least *some* value coming out of terribly unpopular question. :( – Monica Heddneck Jan 18 '16 at 08:32