0

I am trying to figure out wether I can use min_df, max_df and max_features at the same time as arguments of the TfidfVectorizer class from Scikit.Sklearn. I perfectly understand what each of them is for.

I have passed a data to TfidfVectorizer() fixing min_df = 0.05 and max_df = 0.95 that meaning that the terms appearing in less of 5% of my documents are ignored and the same with those appearing in more than 95% of my documents (as explained in Understanding min_df and max_df in scikit CountVectorizer).

Like this, my data, after doing TF-IDF has 360 columns. However, this is way too much so I would like to set max_features = 100. However, when I print the shape of my new data after being transformed, I still get 360 columns, instead of 100 as I was supposed to get.

I also tried to fix just max_features = 100 to check if without the other parameters it would return just the 100 columns but it didn't, it actually has 952 columns. I read the documentation and it is saying that this parameter is supposed to return the top max_features, however I can't observe that.

Does anyone have a clue of what is going on?

Marisa
  • 1,135
  • 3
  • 20
  • 33

1 Answers1

2

I tried to replicate this with max_features=100, min_df=0.05, max_df= 0.95 and the result was <11858x100 sparse matrix of type '<class 'numpy.float64'>', so it worked as intended. Check if you're fitting the data with the vectorizer you created with max_features.

If you could provide your code, it could be easier to identify the problem.