I have a classification problem in Python. I want to find out what are the most important features for classification.
My data is mixed, some columns are categorical values some are not categorical.
I'm applying transformations with OneHotEncoder
and Normalizer
:
columns_for_vectorization = ['A', 'B', 'C', 'D', 'E']
columns_for_normalization = ['F', 'G', 'H']
transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_vectorization),
('Normalizer', Normalizer(), columns_for_normalization)],
remainder='passthrough') # Default is to drop untransformed columns
After that I'm splitting my data and I'm doing transformation:
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.25, random_state=0)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
Then, Im training my model:
clf = RandomForestClassifier(max_depth = 5, n_estimators = 50, random_state = 0)
model = clf.fit(x_train, y_train)
And I'm printing the best features:
print(model.feature_importances_)
I'm getting results like this:
[1.40910562e-03 1.46133832e-03 4.05058130e-03 3.92205197e-03
2.13243521e-03 5.78555893e-03 1.51927254e-03 1.14987114e-03
...
6.37840204e-04 7.21061812e-04 5.77726129e-04 5.32382587e-04]
The problem is, that in beginning, I had 8 features, but because of the transformation, I have more than 20 features (because of the categorical data) How can I handle this? How can I know what beginning feature is the most important?