I have a regular tabular dataset, 100 features from the database are added
I want to push it into a regular sklearn.pipeline in which there will be preprocessing, encoding, some custom transformers, etc.
Penultimate estimator would be SelectKBest(k=10)
For the model, in fact, only 10 features are needed, and the pipeline will require all 100 features
And I would like to use in the Production model only the "necessary" features. I want to avoid extra features to reduce calculation time.
Of course I can rebuild the pipeline, but the whole sklearn is about not doing this. I don’t know how much this is a "standard" practice
I understand why it simply doesn’t work, because 150 features can actually go to the SelectKBest input. In this case, it is not obvious how to determine those "necessary" features.
Perhaps there are some other tools that work with this kind out of the box?
Basic example:
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
data = load_diabetes(as_frame=True)
X, y = data.data, data.target
X = X.iloc[:, :10]
pipeline = Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest(score_func=f_regression, k=4)),
('model', LinearRegression())
])
pipeline.fit(X, y)
selected_features = pipeline.named_steps['feature_selection'].get_support()
selected_features = X.columns[selected_features]
print(f"Selected features: {selected_features}")
# Selected features: Index(['bmi', 'bp', 's4', 's5'], dtype='object')
prod_data = X[selected_features]
pred = pipeline.predict(prod_data)
# Here will be an Exception
# ValueError: The feature names should match those that were passed during fit.
# Feature names seen at fit time, yet now missing:
# - age
# - s1
# - s2
# - s3
# - s6
# - ...