I read spark document, which said
During the fitting process,
CountVectorizer
will select the topvocabSize
words ordered by term frequency across the corpus. An optional parameterminDF
also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.
Could anyone explain it to me more clearly?