I want to apply feature selection by using a binary mask. The binary mask is created by using a template file and a threshold so that ones and zeros in the mask file correspond with values in the template object above or below that threshold. In the next step I want to use this mask to 'cut out' features in the data set and pass this feature selection subset over to other following pipeline steps. Both the mask building procedure and the preprocessing procedure work with keyword arguments (e.g. the threshold value I just mentioned) which can be treated as hyperparameters and thus can be optimized via nested cross validation. How can I (or better is it possible to) implement both the optimization of the mask building procedure and the following pipeline steps in one pipeline?
Here's an example using nilearn's oasis dataset:
Let's say I have a nifti file called template
which serves as the template for the binary mask. I also have grey matter mri-images from 30 subjects (features) and their age (labels):
import numpy as np
from nilearn import datasets
from sklearn.svm import SVC
from nilearn.input_data import NiftiMasker
from sklearn.preprocessing import Binarizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from nilearn import image
n_subjects = 30
############################################################################
# load template file which serves as mask
template = image.load_img("./template.nii.gz").get_data()
# load oasis dataset
oasis_dataset = datasets.fetch_oasis_vbm(n_subjects=n_subjects)
# load features
X = oasis_dataset.gray_matter_maps
# load labels
age = oasis_dataset.ext_vars['age'].astype(float)
The pipeline is supposed to take both the template and the features as input. Then nested cross validation is applied to find the optimal hyperparameters. The pipeline has to contain a cutting function mask_cutter
which takes both the mask and the features as inputs and returns a feature subset of the original data set. In this example both the threshold for set_mask
and the C paramater for svc
should be optimized (note that the following section is not-working pseudo code):
# Set up possible values of parameters to optimize over
p_grid = {
"mask__threshold": np.array([1,2,3]),
"svc__C": np.array([4,5,6])
}
# Binarizer to create binary mask using template
set_mask = Binarizer()
# NiftiMasker to cut out features from X using binary mask
mask_cutter = NiftiMasker()
# Use Support Vector Classification Algorithm
svc = SVC(kernel='linear')
# create pipeline
mask_svc = Pipeline([
('mask',set_mask),
('cut',mask_cutter),
('svc', svc)
])
###########################################################################
grid = GridSearchCV(mask_svc, param_grid=p_grid, cv=3)
nested_cv_scores = cross_val_score(grid, X, age, cv=3)