I am working on a project that is looking for a lean Python AutoML pipeline implementation. As per project definition, data entering the pipeline is in the format of serialised business objects, e.g. (artificial example):
property.json:
{
"area": "124",
"swimming_pool": "False",
"rooms" : [
... some information on individual rooms ...
]
}
Machine learning targets (e.g. predicting whether a property has a swimming pool based on other attributes) are stored within the business object rather than delivered in a separate label vector and business objects may contain observations which should not be used for training.
What I am looking for
I need a pipeline engine which supports initial (or later) pipeline steps that i) dynamically change the targets in the machine learning problem (e.g. extract from input data, threshold real values) and ii) resample input data (e.g. upsampling, downsampling of classes, filtering observations).
The pipeline ideally should look as follows (pseudocode):
swimming_pool_pipeline = Pipeline([
("label_extractor", SwimmingPoolExtractor()), # skipped in prediction mode
("sampler", DataSampler()), # skipped in prediction mode
("featurizer", SomeFeaturization()),
("my_model", FitSomeModel())
])
swimming_pool_pipeline.fit(training_data) # not passing in any labels
preds = swimming_pool_pipeline.predict(test_data)
The pipeline execution engine needs to fulfill/allow for the following:
- During model training (
.fit()
)SwimmingPoolExtractor
extracts target labels from the input training data and passes labels on (alongside independent variables); - In training mode,
DataSampler()
uses the target labels extracted in the previous step to sample observations (e.g. could do minority upsampling or filter observations); - In prediction-mode, the
SwimmingPoolExtractor()
does nothing and just passes on the input data; - In prediction-mode, the
DataSampler()
does nothing and just passes on the input data;
Example
For example, assume that the data looks as follows:
property.json:
"properties" = [
{ "id_": "1",
"swimming_pool": "False",
...,
},
{ "id_": "2",
"swimming_pool": "True",
...,
},
{ "id_": "3",
# swimming_pool key missing
...,
}
]
The application of SwimmingPoolExtractor()
would extract something like:
"labels": [
{"id_": "1", "label": "0"},
{"id_": "2", "label": "1"},
{"id_": "3", "label": "-1"}
]
from the input data and pass it set these as the machine learning pipeline's "targets".
The application of DataSampler()
could for example further include logic that removes any training instance from the entire set of training data which did not contain any swimming_pool
-key (label = -1
).
Subsequent steps should use the modified training data (filtered, not including observation with id_=3
) to fit the model. As stated above, in prediction mode, the DataSampler
and SwimmingPoolExtractor
would just pass through input data
How To
To my knowledge, neither neuraxle
nor sklearn
(for the latter I am sure) offer pipeline steps that meet the required functionality (from what I have gathered so far neuraxle
must at least have support for slicing data, given it implements cross-validation meta-estimators).
Am I missing something, or is there a way to implement such functionality in either of the pipeline models? If not, are there alternatives to the listed libraries within the Python ecosystem that are reasonably mature and support such usecases (leaving aside issues that might arise from designing pipelines in such a manner)?