How can I use sklearn's make_column_selector to select all valid datetime columns?

Question

I want to select columns based on their datetime data types. My DataFrame has for example columns with types np.dtype('datetime64[ns]'), np.datetime64 and 'datetime64[ns, UTC]'.

Is there a generic way to select all columns with a datetime datatype?

For instance, this works:

from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include=(np.dtype('datetime64[ns]'),np.datetime64))
selected_columns = selector(df)

But this doesn't (datatype from a pandas df with 'UTC'):

from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include=np.dtype('datetime64[ns, UTC]')
selected_columns = selector(df)

Compared to numeric data types where you can simply use np.number instead of np.int64 etc.

API-Reference to make_column_selector: LINK

DataJanitor · Accepted Answer · 2023-08-29T09:38:16.913

As far as I know, there is not parent class to catch all the datetime columns in a generic way as you pointed out with np.number.

My guess is you want to use it in a sklearn pipeline for preprocessing? In that case you can use a custom selector:

# Custom function to check if dtype is a datetime
def is_datetime(dtype):
    dtype_str = str(dtype)
    # Check for timezone-aware and timezone-naive datetimes
    if 'datetime64[ns' in dtype_str or np.issubdtype(dtype, np.datetime64):
        return True
    return False

# Custom function to mimic make_column_selector with custom type checks
def custom_column_selector(df, dtype_checker):
    return [col for col, dtype in df.dtypes.items() if dtype_checker(dtype)]

Test

This is a simple test of the custom selector on a dummy DataFrame with different types of data types, and for the dates also with timezone-aware and timezone-naive dates.

# Imports
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn import set_config
from sklearn.compose import make_column_selector

# Config
set_config(transform_output="pandas")

# Example DataFrame with mixed dtypes and different datetime types
df = pd.DataFrame({
    'a': range(10),
    'b': pd.to_datetime(range(10), unit='D', origin=pd.Timestamp('2023-01-01')),
    'c': pd.to_datetime(range(10), utc=True),
    'd': pd.Series(pd.date_range("2023-01-01", periods=10)).dt.tz_localize('America/New_York'),
    'e': pd.Series(pd.date_range("2023-01-01", periods=10)).dt.tz_localize('Europe/London'),
    'f': ['cat', 'dog']*5,
    'g': [1.1, 2.2]*5
})

# Custom function to check if dtype is a datetime
def is_datetime(dtype):
    dtype_str = str(dtype)
    # Check for timezone-aware and timezone-naive datetimes
    if 'datetime64[ns' in dtype_str or np.issubdtype(dtype, np.datetime64):
        return True
    return False

# Define the datetime transformer, exchange with whichever you pefer
def extract_year(df):
    return df.apply(lambda x: x.dt.year)

datetime_transformer = Pipeline(steps=[
    ('year', FunctionTransformer(extract_year, validate=False))
])

# Custom column selector function
def custom_column_selector(dtype_checker):
    def selector(df):
        return [col for col, dtype in df.dtypes.items() if dtype_checker(dtype)]
    return selector

# Create preprocessor with transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
        ('cat', OneHotEncoder(sparse_output=False), make_column_selector(dtype_include=object)),
        ('datetime', datetime_transformer, custom_column_selector(is_datetime))
    ])

# Usage of the preprocessor
data_transformed = preprocessor.fit_transform(df)

After preprocessing:

num__a	num__g	cat__f_cat	cat__f_dog	datetime__b	datetime__c	datetime__d	datetime__e
-1.5667	-1	1	0	2023	1970	2023	2023
-1.21854	1	0	1	2023	1970	2023	2023
-0.870388	-1	1	0	2023	1970	2023	2023
-0.522233	1	0	1	2023	1970	2023	2023
-0.174078	-1	1	0	2023	1970	2023	2023
0.174078	1	0	1	2023	1970	2023	2023
0.522233	-1	1	0	2023	1970	2023	2023
0.870388	1	0	1	2023	1970	2023	2023
1.21854	-1	1	0	2023	1970	2023	2023
1.5667	1	0	1	2023	1970	2023	2023

Hey @DataJanitor, thank you for your answer! Is there also a possible way where you dont need to create a custom selector and instead check the dtypes inside the ``make_column_selector`` function? — JAdel, Aug 30 '23 at 11:33
You're welcome, Jörg. As far as I know, unfortunately, there isn't as functions like my `is_datetime` cannot be passed to `make_column_selector`. I also thoght of using the `pattern` parameter (like this `r'datetime64\[ns(, .+)?\]'`), but it refers to the column name, not the dtype of the column. So just in case your datetime columns have a naming pattern, you might want to try the `pattern` parameter. — DataJanitor, Aug 30 '23 at 12:33

score 0 · Answer 2 · answered Aug 29 '23 at 03:01

0

make_column_selector calls pandas.DataFrame.select_dtypes, if you check the api page:

To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

So you can try:

df = pd.DataFrame({
    'a':range(10),
    'b':pd.to_datetime(range(10)),
    'c':pd.to_datetime(range(10),utc=True)
})
df.dtypes

a                  int64
b         datetime64[ns]
c    datetime64[ns, UTC]

from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include='datetime64[ns, UTC]')
selector(df)

answered Aug 29 '23 at 03:01

StupidWolf

45,075
17
40
72

While everything you state is correct, is does not offer a **generic** way to catch **all datetime types**, as asked in the question. – DataJanitor Aug 29 '23 at 07:57
For example, your code does not cover `datetime64[ns, America/New_York]` – DataJanitor Aug 29 '23 at 08:14

How can I use sklearn's make_column_selector to select all valid datetime columns?

2 Answers2

Test