4

I can't seem to find an answer to my exact problem. Can anyone help?

A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y").

I did bag-of-words on the text:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(df["Notes"])

My matrix is 6290 x 4650. No problem getting the word names (i.e. feature names) :

feature_names = vectorizer.get_feature_names()
feature_names

Next, I want to know which of these 4650 are most associated with above average resolution times; and reduce the matrix I may want to use in a predictive model. I do a chi-square test to find the top 20 most important words.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=20)
selector.fit(matrix, y)
top_words = selector.get_support().nonzero()

# Pick only the most informative columns in the data.
chi_matrix = matrix[:,top_words[0]]

Now I'm stuck. How do I get the words from this reduced matrix ("chi_matrix")? What are my feature names? I was trying this:

chi_matrix.feature_names[selector.get_support(indices=True)].tolist()

Or

chi_matrix.feature_names[features.get_support()]

These gives me an error: feature_names not found. What am I missing?

A

Amanda
  • 153
  • 7

2 Answers2

5

After figuring out really what I wanted to do (thanks Daniel) and doing more research, I found a couple other ways to meet my objective.

Way 1 - https://glowingpython.blogspot.com/2014/02/terms-selection-with-chi-square.html

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])

from sklearn.feature_selection import chi2
chi2score = chi2(X,df['AboveAverage'])[0]

wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1]) 
topchi2 = zip(*wchi2[-20:])
show=list(topchi2)
show

Way 2 - This is the way I used because it was the easiest for me to understand and produced a nice output listing the word, chi2 score, and p-value. Another thread on here: Sklearn Chi2 For Feature Selection

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2

vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])

y = df['AboveAverage']

# Select 10 features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)), 
                                       columns=['ftr', 'score', 'pval'])
chi2_scores
Amanda
  • 153
  • 7
3

I had a similar problem recently, but I was not constricted to using the 20 most relevant words. Rather, I could select the words which had a chi score higher than a set threshold. I will give you the method I used to achieve this second task. The reason why this is preferable than just using the first n words accordingly to their chi-score, is that those 20 words may have an extremely low score and thus contribute next to nothing to the classification task.

Here is how I have done it for a binary classification task:

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_selection import chi2

    THRESHOLD_CHI = 5 # or whatever you like. You may try with
     # for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
     # and measure the f1 scores

    X = df['text']
    y = df['labels']

    cv = CountVectorizer()
    cv_sparse_matrix = cv.fit_transform(X)
    cv_dense_matrix = cv_sparse_matrix.todense()

    chi2_stat, pval = chi2(cv_dense_matrix, y)

    chi2_reshaped = chi2_stat.reshape(1,-1)
    which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
    which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])

The result is a matrix containing ones where the terms have a chi score higher than the threshold, and zeroes where they have a chi score lower than the threshold. This matrix can then be np.dot with either a cv matrix or a tfidf matrix, and subsequently passed to the fit method of a classifier.

If you do this, the columns of the matrix which_ones_to_keep correspond to the columns of the CountVectorizer object, and you can thus determine which terms were relevant for the given labels by comparing the non-zero columns of the which_ones_to_keep matrix to the indices of the .get_feature_names(), or you can just forget about it and pass it directly to a classifier.

Daneel R.
  • 527
  • 3
  • 9
  • Thanks for your help, Daniel. It makes sense to select terms based on chi scores, rather than just an arbitrary number of terms. The code you provide works perfectly, and I now have a 4650 x 4650 "which_ones_to_keep" matrix. I need to spend some time working through the steps to identify which terms are being kept. This kind of work is new to my company, so I feel I need to explain more of how the model is being created. – Amanda Oct 05 '18 at 15:46
  • You are welcome. The columns in the output matrix correspond one to one with the column of the CountVectorizer, and thus the feature names of the latter are the unique identifier of the former. Same indices and stuff. Also, the matrix is provided in matrix form, rather than just as a vector, because I needed it in matrix form for a `np.dot` with tf-idf or cv matrix. You can just use the vector instead if you don't have a classifier to train, but just want to extract the best predictors for the given labels. – Daneel R. Oct 05 '18 at 15:52