I am trying to figure out wether I can use min_df
, max_df
and max_features
at the same time as arguments of the TfidfVectorizer class from Scikit.Sklearn. I perfectly understand what each of them is for.
I have passed a data to TfidfVectorizer() fixing min_df = 0.05
and max_df = 0.95
that meaning that the terms appearing in less of 5% of my documents are ignored and the same with those appearing in more than 95% of my documents (as explained in
Understanding min_df and max_df in scikit CountVectorizer).
Like this, my data, after doing TF-IDF has 360 columns. However, this is way too much so I would like to set max_features = 100
. However, when I print the shape of my new data after being transformed, I still get 360 columns, instead of 100 as I was supposed to get.
I also tried to fix just max_features = 100
to check if without the other parameters it would return just the 100 columns but it didn't, it actually has 952 columns. I read the documentation and it is saying that this parameter is supposed to return the top max_features, however I can't observe that.
Does anyone have a clue of what is going on?