The defaults for min_df and max_df are 1 and 1.0, respectively. These defaults really don't do anything at all.
That being said, I believe the currently accepted answer by @Ffisegydd answer isn't quite correct.
For example, run this using the defaults, to see that when min_df=1
and max_df=1.0
, then
1) all tokens that appear in at least one document are used (e.g., all tokens!)
2) all tokens that appear in all documents are used (we'll test with one candidate: everywhere).
cv = CountVectorizer(min_df=1, max_df=1.0, lowercase=True)
# here is just a simple list of 3 documents.
corpus = ['one two three everywhere', 'four five six everywhere', 'seven eight nine everywhere']
# below we call fit_transform on the corpus and get the feature names.
X = cv.fit_transform(corpus)
vocab = cv.get_feature_names()
print vocab
print X.toarray()
print cv.stop_words_
We get:
[u'eight', u'everywhere', u'five', u'four', u'nine', u'one', u'seven', u'six', u'three', u'two']
[[0 1 0 0 0 1 0 0 1 1]
[0 1 1 1 0 0 0 1 0 0]
[1 1 0 0 1 0 1 0 0 0]]
set([])
All tokens are kept. There are no stopwords.
Further messing around with the arguments will clarify other configurations.
For fun and insight, I'd also recommend playing around with stop_words = 'english'
and seeing that, peculiarly, all the words except 'seven' are removed! Including `everywhere'.