I can't seem to find an answer to my exact problem. Can anyone help?
A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y").
I did bag-of-words on the text:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(df["Notes"])
My matrix is 6290 x 4650. No problem getting the word names (i.e. feature names) :
feature_names = vectorizer.get_feature_names()
feature_names
Next, I want to know which of these 4650 are most associated with above average resolution times; and reduce the matrix I may want to use in a predictive model. I do a chi-square test to find the top 20 most important words.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=20)
selector.fit(matrix, y)
top_words = selector.get_support().nonzero()
# Pick only the most informative columns in the data.
chi_matrix = matrix[:,top_words[0]]
Now I'm stuck. How do I get the words from this reduced matrix ("chi_matrix")? What are my feature names? I was trying this:
chi_matrix.feature_names[selector.get_support(indices=True)].tolist()
Or
chi_matrix.feature_names[features.get_support()]
These gives me an error: feature_names not found. What am I missing?
A