How can I use my own custom function in an sk-learn pipeline?

Question

I'm new to the sk-learn pipeline and would like use my own form of discretized binning. I need to bin a column of values based on the cumulative sum of another column associated with the original column. I have a working function:

def dynamic_bin(df, column, weight, minimum):
    """
    

    Parameters
    ----------
    df : dataframe
    column : column to be binned
    weight : column that will dictate the bin
    minimum : minimum weight per bin

    Returns
    -------
    df : dataframe with new binned column

    """
    bins = [-np.inf]
    labels = [] 
    hold_over = []
    for i in sorted(df[column].unique()):
        g = df[df[column] == i].groupby(column).agg({weight:'sum'}).reset_index()
        
        if g[weight].values[0] < minimum:
            if hold_over is None:
                hold_over.append(g[weight].values[0])
                
            elif (sum(hold_over) + g[weight].values[0]) < minimum:
                hold_over.append(g[weight].values[0])
 
                
            elif (sum(hold_over) + g[weight].values[0]) >= minimum:
                hold_over.clear()
                bins.append(g[column].values[0])
                labels.append(g[column].values[0])
                
            
        elif g[weight].values[0] >= minimum:
            bins.append(g[column].values[0])
            labels.append(g[column].values[0])
    
    bins.pop()
    bins.append(np.inf)
    
    
    str_column = str(column)+str("_binned")
    # print(str_column)
    df[str_column] = pd.cut(df[column],
                            bins = bins,
                            labels = labels)
    

    return df

This is how I tried to make it a class.

from sklearn.base import  BaseEstimator, TransformerMixin

class dynamic_bin(BaseEstimator, TransformerMixin):
    def __init__(self, weight, minimum):
        self.weight = weight
        self.minimum = minimum
    def fit(self, X, y=None):
        return self
    def tranform(self, X):
        """
    

        Parameters
        ----------
        df : dataframe
        column : column to be binned
        weight : column that will dictate the bin
        minimum : minimum weight per bin
    
        Returns
        -------
        df : dataframe with new binned column
    
        """
        bins = [-np.inf]
        labels = [] 
        hold_over = []
        for i in sorted(df[column].unique()):
            g = df[df[column] == i].groupby(column).agg({weight:'sum'}).reset_index()
            
            if g[weight].values[0] < minimum:
                if hold_over is None:
                    hold_over.append(g[weight].values[0])
                    
                elif (sum(hold_over) + g[weight].values[0]) < minimum:
                    hold_over.append(g[weight].values[0])
     
                    
                elif (sum(hold_over) + g[weight].values[0]) >= minimum:
                    hold_over.clear()
                    bins.append(g[column].values[0])
                    labels.append(g[column].values[0])
                    
                
            elif g[weight].values[0] >= minimum:
                bins.append(g[column].values[0])
                labels.append(g[column].values[0])
        
        bins.pop()
        bins.append(np.inf)
        
        
        str_column = str(column)+str("_binned")
        # print(str_column)
        df[str_column] = pd.cut(df[column],
                                bins = bins,
                                labels = labels)
        
    
        return df[str_column]

When I try to implement it the following way, i get the error underneath it:

column_trans = ColumnTransformer(
    [
        ("binned_numeric", dynamic_bin(weight = 'Exposure', minimum = 1000),
            ["VehAge", "DrivAge"]),
        ("onehot_categorical", OneHotEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"]),
        ("passthrough_numeric", "passthrough",
            ["BonusMalus"]),
        ("log_scaled_numeric", log_scale_transformer,
            ["Density"]),
    ],
    remainder="drop",
)
X = column_trans.fit_transform(df)

TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'dynamic_bin(minimum=1000, weight='Exposure')' (type <class 'dynamic_bin.dynamic_bin'>) doesn't.

I read the following but I don't really follow it.
Put customized functions in Sklearn pipeline

Does anyone spot the mistake i've made?

The question you linked _is_ the answer. To use a function in a pipeline, you need it to implement `.fit()` and `.transform()`. That question shows how to inherit from the base classes provided by sklearn to make an easy class wrapper for the pipeline to utilize the function(s) in question — G. Anderson, May 24 '21 at 18:36
It looks like you're fitting bins to the data, then returning the bins or binned data. For a built-in example, see [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html) — G. Anderson, May 24 '21 at 18:42

score 0 · Accepted Answer · answered May 25 '21 at 13:38

The error itself is due to a typo in your method declaration. You implemented a function called tranform (note the missing 's') in your custom transformer class. That is why the interpreter is complaining that your custom transformer has not implemented transform.

While this will be a simple fix, you should also be aware that you have not adjusted your custom function to be used in the class you defined. For example:

the variable df should be renamed to X
weight and minimum are now object attributes and need to be referenced to as self.weight and self.minimum
the variable column is undeclared

You will need to fix these issues as well. In regard to this, be aware that ColumnTransformer will only pass the subset of columns to the transformer that is meant to be transformed by this particular transformer. That means if you only pass the columns VehAge and DrivAge to dynamic_bin it cannot access the column Exposure.

Thank you. Your answer makes sense. Does the last point about not being able to access 'Exposure' mean it's not possible? — Jordan, May 25 '21 at 14:46
There are many workarounds to solve this. One solution would be to pass the 'Exposure' column to your custom transformer as well, i.e. to specify it in the `ColumnTransformer`'s step along 'VehAge' and 'DrivAge'. Then you can perform your transformation based on 'Exposure'. If you then want to remove 'Exposure' from the final dataframe, you'd drop it within your `transform` method before you return it. — afsharov, May 25 '21 at 15:20

How can I use my own custom function in an sk-learn pipeline?

1 Answers1