I am working on Feature selection process currently and as part of this, I need to apply chi-squared test over a list of available features present in a panda dataframe and determine which are the top 'n' best features of the panda dataframe.
From articles available on internet I can understand that the value of 'n' is determined by the value that we assign to the 'k' parameter of SelectKBest function that can be imported from sklearn.feature_selection.
But how do I get to know the feature / column names or numbers of the top 'n' features that are selected by the chi-squared test.
For better understanding below I mention the example (Thanks to chris albon for an easy example in his site) taken from this link : https://chrisalbon.com/machine-learning/chi-squared_for_feature_selection.html
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)
type(X_kbest)
# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
As can be seen from the code, the input data is passed as a numpy array. Assume the four columns has names as Col_A, Col_B, Col_C, Col_D. And the test has chosen 3rd and 4th column as the two best features. This can be seen by printing the value of "X_kbest"
print(X_kbest)
[[1 0]
[1 0]
[1 0]
...,
[5 2]
[5 2]
[5 1]]
But I need my output as a list containing the only the selected feature names (In this case, it is Col_C and Col_D) or feature names along with the data