1

I have a regular tabular dataset, 100 features from the database are added

I want to push it into a regular sklearn.pipeline in which there will be preprocessing, encoding, some custom transformers, etc.

Penultimate estimator would be SelectKBest(k=10)

For the model, in fact, only 10 features are needed, and the pipeline will require all 100 features

And I would like to use in the Production model only the "necessary" features. I want to avoid extra features to reduce calculation time.

Of course I can rebuild the pipeline, but the whole sklearn is about not doing this. I don’t know how much this is a "standard" practice

I understand why it simply doesn’t work, because 150 features can actually go to the SelectKBest input. In this case, it is not obvious how to determine those "necessary" features.

Perhaps there are some other tools that work with this kind out of the box?

Basic example:

from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

data = load_diabetes(as_frame=True)
X, y = data.data, data.target

X = X.iloc[:, :10]

pipeline = Pipeline([
    ('scaler', StandardScaler()), 
    ('feature_selection', SelectKBest(score_func=f_regression, k=4)),
    ('model', LinearRegression())
])

pipeline.fit(X, y)

selected_features = pipeline.named_steps['feature_selection'].get_support()
selected_features = X.columns[selected_features]
print(f"Selected features: {selected_features}")
# Selected features: Index(['bmi', 'bp', 's4', 's5'], dtype='object')

prod_data = X[selected_features]

pred = pipeline.predict(prod_data)

# Here will be an Exception
# ValueError: The feature names should match those that were passed during fit.
# Feature names seen at fit time, yet now missing:
# - age
# - s1
# - s2
# - s3
# - s6
# - ...
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Nikitosiwe
  • 33
  • 6
  • My understanding is that you have 100 features incoming from a database, and you are using a pipeline that processes the 100 features down to 10 features. You then give these 10 features to a model. Are `SelectKBest` and your final model both part of the pipeline, or are they separate from the pipeline? It sounds like you need the pipeline to reduce 100 features down to 10 - why can't you keep this pipeline? – some3128 Aug 19 '23 at 10:53
  • Please provide enough code so others can better understand or reproduce the problem. – Community Aug 19 '23 at 15:16
  • @some3128 I've added some code, for example. – Nikitosiwe Aug 19 '23 at 19:40
  • Certainly, I can set up an initial pipeline solely for feature selection (without any model). Subsequently, I can devise a second pipeline which is essentially similar but integrated with a model, and I'll train it using only the chosen features. However, I'm uncertain if this is the ideal approach. @some3128 – Nikitosiwe Aug 19 '23 at 19:50
  • Thanks for the clarifications. Please see the answer I've posted and let me know what you think. – some3128 Aug 20 '23 at 10:23

3 Answers3

2

I've generally suggested the solution you're trying to avoid: rebuild a pipeline without the selection steps and without the removed columns in the training set.

It may be possible to identify and change the fitted attributes of each pipeline step (remove the entries from mean_ and scale_ from a scaler etc., reduce n_features_in_ and feature names in, ...), and with care that could be automated. But messing with internals is a bit risky: removing the wrong thing could produce no errors but perform the wrong scaling per column.

Another low-tech solution: you don't care what the values of the removed columns are for a prediction row, so just make them up. Your sklearn pipeline will still process the fake values, but you don't need to gather those fields in your production environment.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
1

In your pipeline, the SelectKBest step will be responsible for keeping 4 features, and dropping the others. The model that comes after this step will only see those 4 features. This is because internally the pipeline will use SelectKBest to transform the data after fitting it, so that step only passes on the 4 selected features.

To get the prediction using the selected features, run pipeline.fit(X, y).predict(X). That'll automatically drop the unnecessary features before they get to the final model. I've modified your code:

from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

data = load_diabetes(as_frame=True)
X, y = data.data, data.target

X = X.iloc[:, :10]

#Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()), 
    ('feature_selection', SelectKBest(score_func=f_regression, k=4)),
    ('model', LinearRegression())
])

#Fit and predict. Final model only uses the k best features.
pred_using_selected_features = pipeline.fit(X, y).predict(X)

#Print results
selected_features = pipeline.named_steps['feature_selection'].get_support()
print(f"Selected features: {X.columns[selected_features].to_list()}")

print(f'Model saw {pipeline.named_steps["model"].n_features_in_} features')

enter image description here

During the pipeline's progress 4 features were selected, and the model only saw those 4 features.

some3128
  • 1,430
  • 1
  • 2
  • 8
  • Sure, it works. But you pass in predict function full X dataset (with 10 columns). My question is, how to reduce this amount of columns, because in fact model needs only 4 features – Nikitosiwe Aug 20 '23 at 10:51
  • I passed in the full `X` because the pipeline will filter out the 4 features for the model automatically. But you would prefer to only pass in the 4 features from the start? In that case I am not sure if there's a single-pipeline solution. If the pipeline was originally *trained* on 10 features, then it'll always ask for 10 features at the input, and if you supply fewer features than it was trained on I think it'll error. – some3128 Aug 20 '23 at 10:56
1

I agree with @some3128, it looks like there's no single Pipeline solution.

As I understood, estimators like SelectKBest are using to reduce features that were generated internally in Pipeline by other estimators like PolynomialFeatures, PCA, etc.

Sure, maybe there is a way to determine best generated features and prevent generation of useless features on prediction time. But It's more like "low level" optimization.

Reducing features before pipeline - it's about Feature Engineering process and it's separate question.

Nikitosiwe
  • 33
  • 6