When defining custom transformers or estimators for PySpark ML, the "pattern" is to have "Params MixIns" and a Transformer class.
Example of those Params MixIns are the HasInputCol
, HasOutputCol
, etc.
Code source for a HasInputCol
MixIn in PySpark 3.3
# Source: https://github.com/apache/spark/blob/branch-3.3/python/pyspark/ml/param/shared.py#L184-L203
class HasInputCol(Params):
inputCol: "Param[str]" = Param(
Params._dummy(),
"inputCol",
"input column name.",
typeConverter=TypeConverters.toString,
)
def __init__(self) -> None:
super(HasInputCol, self).__init__()
def getInputCol(self) -> str:
return self.getOrDefault(self.inputCol)
However, even though those mixins provide getters, it is expected that the user provide a user-defined setter, as seen in the following:
from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
class ColumnDuplicatorTransformer(
Transformer, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable,
):
# Based on: https://stackoverflow.com/a/32337101/7690767
@keyword_only
def __init__(self, input_col=None, outputCol=None):
super().__init__()
kwargs = self._input_kwargs
self.setParams(**kwargs)
@keyword_only
def setParams(self, input_col=None, outputCol=None):
kwargs = self._input_kwargs
return self._set(**kwargs)
# Required in Spark >= 3.0
def setInputCol(self, value):
return self._set(inputCol=value)
# Required in Spark >= 3.0
def setOutputCol(self, value):
return self._set(outputCol=value)
def _transform(self, dataset):
return dataset.withColumn(self.getOutputCol(), self.getInputCol())
My questions are:
- Why are the two setters
setInputCol
andsetOutputCol
, expected to be defined in the child class instead of the MixIn? If it is to make read-only params, why not having them defined in a ancestor class with a NotImplementedError or something like that? - Why is there a need for the
setParams
? Why not called the_set
directly in the init? Is this to facilitate inheritance down the hierarchy? - Are there any specific reason about using
@keyword_only
rather than passing all the parameters explicitly? Seems toimplicit
, specially since it is adding the_input_kwargs
at runtime which does not seem like good practice. - Is there any documentation about the
right way
to build a custom Transformer? Everywhere I look is either stackoverflow or personal blogs, there doesn't seem to be an official guideline.
My first impression is that this is an effect of porting a Java Framework like Spark to Python while trying to keep the API as close to the original as possible and hence the unconventional camelcase for example. But I am interested from a design perspective if this was preferred over a more Pythonic approach (using Protocols, Properties and Descriptors for example).