1

I read spark document, which said

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.

Could anyone explain it to me more clearly?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • I think this says it all. https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer – Vpalakkat Aug 07 '18 at 11:56

1 Answers1

5

minDF is used for removing terms that appear too infrequently.

For example: minDF = 0.01 means "ignore terms that appear in less than 1% of the documents". minDF = 5 means "ignore terms that appear in less than 5 documents".

The default minDF is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

vocabSize is the maximum number of tokens you can have in your vocabulary. The default is 1 << 18. I.e., 2^18 or 262,144.

minDF: https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L430-L435 vocabSize: https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L444-L446

Peng Lee
  • 15
  • 5
Vpalakkat
  • 136
  • 4