About this:
NLP in Python: Obtain word names from SelectKBest after vectorizing
I found this code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
THRESHOLD_CHI = 5 # or whatever you like. You may try with
# for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
# and measure the f1 scores
X = df['text']
y = df['labels']
cv = CountVectorizer()
cv_sparse_matrix = cv.fit_transform(X)
cv_dense_matrix = cv_sparse_matrix.todense()
chi2_stat, pval = chi2(cv_dense_matrix, y)
chi2_reshaped = chi2_stat.reshape(1,-1)
which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])
This code computes the chi squared test and should keep the best features within a chosen threshold. My question is how to choose a theshold for the chi squared test scores?