0

About this:

NLP in Python: Obtain word names from SelectKBest after vectorizing

I found this code:

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_selection import chi2

    THRESHOLD_CHI = 5 # or whatever you like. You may try with
     # for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
     # and measure the f1 scores

    X = df['text']
    y = df['labels']

    cv = CountVectorizer()
    cv_sparse_matrix = cv.fit_transform(X)
    cv_dense_matrix = cv_sparse_matrix.todense()

    chi2_stat, pval = chi2(cv_dense_matrix, y)

    chi2_reshaped = chi2_stat.reshape(1,-1)
    which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
    which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])

This code computes the chi squared test and should keep the best features within a chosen threshold. My question is how to choose a theshold for the chi squared test scores?

juuso
  • 612
  • 7
  • 26

1 Answers1

2

Chi square does not have a specific range of outcome, so it's hard to determine a threshold beforehand. Usually what you can do is to sort the variables depending on their p values, the logic is that lower p values are better, because they imply a higher correlation between features and the target variable (we want to discard features that are independent, i.e. not predictors of the target variable). In this case you have anyway to decide how many features to keep, and that is a hyper parameter that you can tune manually or even better by using a grid search.

Be aware that you can avoid to perform the selection manually, sklearn implement already a function SelectKBest to select the best k features based on chi square, you can use it as follow:

from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)

But if for any reason you want to rely solely on the raw chi2 value, you could calculate the minimum and maximum values among the variables, and then divide the interval in n steps to test trough a grid search.

Edoardo Guerriero
  • 1,210
  • 7
  • 16