Custom transformer that splits dates into new column

Question

I am following the sklearn_pandas walk through found on the sklearn_pandas README on github and am trying to modify the DateEncoder() custom transformer example to do 2 additional things:

Convert string type columns to datetime while taking the date format as a parameter
Append the original column names when spitting out the new columns. E.g: if Input Column: Date1 then Outputs: Date1_year, Date1_month, Date_1 day.

Here is my attempt (with a rather rudimentary understanding of sklearn pipelines):

import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper

class DateEncoder(TransformerMixin):

    '''
    Specify date format using python strftime formats
    '''

    def __init__(self, date_format='%Y-%m-%d'):
        self.date_format = date_format

    def fit(self, X, y=None):
        self.dt = pd.to_datetime(X, format=self.date_format)
        return self

    def transform(self, X):
        dt = X.dt
        return pd.concat([dt.year, dt.month, dt.day], axis=1)


data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23', 
                                 '2004-07-20','2007-12-21','2006-12-22','2003-04-23'],   
                     'dates2'  : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23', 
                                 '2014-07-20','2011-12-21','2014-12-22','2015-04-23']})

DATE_COLS = ['dates1', 'dates2']

Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)

But on runtime, I get the following error:

AttributeError: Can only use .dt accessor with datetimelike values

Why am I getting this error and how to fix it? Also any help with renaming the column names as mentioned above with the original columns (Date1_year, Date1_month, Date_1 day) would be greatly appreciated!

You convert `X` to datetime at `self.dt` in `fit`, but `transform()` isn't working with `self.dt`. `X.dt` fails because `X` isn't of type datetime. — andrew_reece, Oct 15 '17 at 20:25

edesz · Answer 1 · 2020-06-10T15:48:11.480

I know this is late, but if you're still interested in a way to do this while renaming the columns with the custom transformer...

I used the approach of adding the method get_feature_names to the custom transformer inside a pipeline with the ColumnTransformer (overview). You can then use the .named_steps attribute to access the pipeline's step and then get to get_feature_names and then get the column_names, which ultimately holds the names of the custom column names to be used. This way you can retrieve column names similar to the approach in this SO post.

I had to run this with a pipeline because when I attempted to do it as a standalone custom transformer it went badly wrong (so I won't post that incomplete attempt here) - though you may have better luck.

Here is the raw code showing the pipeline

import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


data2 = pd.DataFrame(
    {"dates1": ["2001-12-20", "2002-10-21", "2003-08-22", "2004-08-23",
        "2004-07-20", "2007-12-21", "2006-12-22", "2003-04-23"
    ], "dates2": ["2012-12-20", "2009-10-21", "2016-08-22", "2017-08-23",
        "2014-07-20", "2011-12-21", "2014-12-22", "2015-04-23"]})

DATE_COLS = ['dates1', 'dates2']

pipeline = Pipeline([
    ('transform', ColumnTransformer([
        ('datetimes', Pipeline([
            ('formatter', DateFormatter()), ('encoder', DateEncoder()),
        ]), DATE_COLS),
    ])),
])

data3 = pd.DataFrame(pipeline.fit_transform(data2))
data3_names = (
    pipeline.named_steps['transform']
    .named_transformers_['datetimes']
    .named_steps['encoder']
    .get_feature_names()
)
data3.columns = data3_names

print(data2)
print(data3)

The output is

       dates1      dates2
0  2001-12-20  2012-12-20
1  2002-10-21  2009-10-21
2  2003-08-22  2016-08-22
3  2004-08-23  2017-08-23
4  2004-07-20  2014-07-20
5  2007-12-21  2011-12-21
6  2006-12-22  2014-12-22
7  2003-04-23  2015-04-23
   dates1_year  dates1_month  dates1_day  dates2_year  dates2_month  dates2_day
0         2001            12          20         2012            12          20
1         2002            10          21         2009            10          21
2         2003             8          22         2016             8          22
3         2004             8          23         2017             8          23
4         2004             7          20         2014             7          20
5         2007            12          21         2011            12          21
6         2006            12          22         2014            12          22
7         2003             4          23         2015             4          23

The custom transformers are here (skipping DateFormatter, since it is identical to yours)

class DateEncoder(TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        dfs = []
        self.column_names = []
        for column in X:
            dt = X[column].dt
            # Assign custom column names
            newcolumnnames = [column+'_'+col for col in ['year', 'month', 'day']]
            df_dt = pd.concat([dt.year, dt.month, dt.day], axis=1)
            # Append DF to list to assemble list of DFs
            dfs.append(df_dt)
            # Append single DF's column names to blank list
            self.column_names.append(newcolumnnames)
        # Horizontally concatenate list of DFs
        dfs_dt = pd.concat(dfs, axis=1)
        return dfs_dt

    def get_feature_names(self):
        # Flatten list of column names
        self.column_names = [c for sublist in self.column_names for c in sublist]
        return self.column_names

Rationale for DateEncoder

The loop over pandas columns allows the datetime attributes to be extracted from each datetime column. In the same loop, the custom column names are constructed. These are then added to a blank list under self.column_names which is returned in the method get_feature_names (though it has to be flattened before assigning to a dataframe).

For this particular case, you could potentially skip sklearn_pandas.

Details

sklearn = 0.20.0
pandas = 0.23.4
numpy = 1.15.2
python = 2.7.15rc1

Thanks for sharing. Unfortunately, I don't use pipelines anymore. I found it was easier just to chain functions...less overhead building those transformers. But this is a good approach nonetheless. — Jason, Oct 06 '18 at 22:34

score 2 · Accepted Answer · answered Oct 16 '17 at 03:43

I was able to break the data format conversion and date splitter into two separate transformers and it worked.

import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper



data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23', 
                                 '2004-07-20','2007-12-21','2006-12-22','2003-04-23'],   
                     'dates2'  : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23', 
                                 '2014-07-20','2011-12-21','2014-12-22','2015-04-23']})

class DateFormatter(TransformerMixin):

    def fit(self, X, y=None):
        # stateless transformer
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xdate = X.apply(pd.to_datetime)
        return Xdate


class DateEncoder(TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        dt = X.dt
        return pd.concat([dt.year, dt.month, dt.day], axis=1)


DATE_COLS = ['dates1', 'dates2']

datemult = DataFrameMapper(
            [ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS     ] 
            , input_df=True, df_out=True)

df = datemult.fit_transform(data2)

This code outputs:

Out[4]: 
   dates1_0  dates1_1  dates1_2  dates2_0  dates2_1  dates2_2
0      2001        12        20      2012        12        20
1      2002        10        21      2009        10        21
2      2003         8        22      2016         8        22
3      2004         8        23      2017         8        23
4      2004         7        20      2014         7        20
5      2007        12        21      2011        12        21
6      2006        12        22      2014        12        22
7      2003         4        23      2015         4        23

However I am still looking for a way to rename the new columns while applying the DateEncoder() transformer. E.g: dates_1_0 --> dates_1_year and dates_2_2 --> dates_2_month. I'd be happy to select that as the solution.

Custom transformer that splits dates into new column

2 Answers2