0

When defining custom transformers or estimators for PySpark ML, the "pattern" is to have "Params MixIns" and a Transformer class.

Example of those Params MixIns are the HasInputCol, HasOutputCol, etc.

Code source for a HasInputCol MixIn in PySpark 3.3

# Source: https://github.com/apache/spark/blob/branch-3.3/python/pyspark/ml/param/shared.py#L184-L203
class HasInputCol(Params):
    inputCol: "Param[str]" = Param(
        Params._dummy(),
        "inputCol",
        "input column name.",
        typeConverter=TypeConverters.toString,
    )

    def __init__(self) -> None:
        super(HasInputCol, self).__init__()

    def getInputCol(self) -> str:
        return self.getOrDefault(self.inputCol)

However, even though those mixins provide getters, it is expected that the user provide a user-defined setter, as seen in the following:

from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable

class ColumnDuplicatorTransformer(
    Transformer, HasInputCol, HasOutputCol, DefaultParamsReadable, DefaultParamsWritable,
):
    # Based on: https://stackoverflow.com/a/32337101/7690767
    @keyword_only
    def __init__(self, input_col=None, outputCol=None):
        super().__init__()
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, input_col=None, outputCol=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)

    # Required in Spark >= 3.0
    def setInputCol(self, value):
        return self._set(inputCol=value)

    # Required in Spark >= 3.0
    def setOutputCol(self, value):
        return self._set(outputCol=value)

    def _transform(self, dataset):
        return dataset.withColumn(self.getOutputCol(), self.getInputCol())

My questions are:

  1. Why are the two setters setInputCol and setOutputCol, expected to be defined in the child class instead of the MixIn? If it is to make read-only params, why not having them defined in a ancestor class with a NotImplementedError or something like that?
  2. Why is there a need for the setParams? Why not called the _set directly in the init? Is this to facilitate inheritance down the hierarchy?
  3. Are there any specific reason about using @keyword_only rather than passing all the parameters explicitly? Seems to implicit, specially since it is adding the _input_kwargs at runtime which does not seem like good practice.
  4. Is there any documentation about the right way to build a custom Transformer? Everywhere I look is either stackoverflow or personal blogs, there doesn't seem to be an official guideline.

My first impression is that this is an effect of porting a Java Framework like Spark to Python while trying to keep the API as close to the original as possible and hence the unconventional camelcase for example. But I am interested from a design perspective if this was preferred over a more Pythonic approach (using Protocols, Properties and Descriptors for example).

0 Answers0