2

I want to select columns based on their datetime data types. My DataFrame has for example columns with types np.dtype('datetime64[ns]'), np.datetime64 and 'datetime64[ns, UTC]'.

Is there a generic way to select all columns with a datetime datatype?


For instance, this works:

from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include=(np.dtype('datetime64[ns]'),np.datetime64))
selected_columns = selector(df)

But this doesn't (datatype from a pandas df with 'UTC'):

from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include=np.dtype('datetime64[ns, UTC]')
selected_columns = selector(df)

Compared to numeric data types where you can simply use np.number instead of np.int64 etc.

API-Reference to make_column_selector: LINK

JAdel
  • 1,309
  • 1
  • 7
  • 24

2 Answers2

1

As far as I know, there is not parent class to catch all the datetime columns in a generic way as you pointed out with np.number.

My guess is you want to use it in a sklearn pipeline for preprocessing? In that case you can use a custom selector:

# Custom function to check if dtype is a datetime
def is_datetime(dtype):
    dtype_str = str(dtype)
    # Check for timezone-aware and timezone-naive datetimes
    if 'datetime64[ns' in dtype_str or np.issubdtype(dtype, np.datetime64):
        return True
    return False

# Custom function to mimic make_column_selector with custom type checks
def custom_column_selector(df, dtype_checker):
    return [col for col, dtype in df.dtypes.items() if dtype_checker(dtype)]

Test

This is a simple test of the custom selector on a dummy DataFrame with different types of data types, and for the dates also with timezone-aware and timezone-naive dates.

# Imports
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn import set_config
from sklearn.compose import make_column_selector

# Config
set_config(transform_output="pandas")

# Example DataFrame with mixed dtypes and different datetime types
df = pd.DataFrame({
    'a': range(10),
    'b': pd.to_datetime(range(10), unit='D', origin=pd.Timestamp('2023-01-01')),
    'c': pd.to_datetime(range(10), utc=True),
    'd': pd.Series(pd.date_range("2023-01-01", periods=10)).dt.tz_localize('America/New_York'),
    'e': pd.Series(pd.date_range("2023-01-01", periods=10)).dt.tz_localize('Europe/London'),
    'f': ['cat', 'dog']*5,
    'g': [1.1, 2.2]*5
})

# Custom function to check if dtype is a datetime
def is_datetime(dtype):
    dtype_str = str(dtype)
    # Check for timezone-aware and timezone-naive datetimes
    if 'datetime64[ns' in dtype_str or np.issubdtype(dtype, np.datetime64):
        return True
    return False

# Define the datetime transformer, exchange with whichever you pefer
def extract_year(df):
    return df.apply(lambda x: x.dt.year)

datetime_transformer = Pipeline(steps=[
    ('year', FunctionTransformer(extract_year, validate=False))
])

# Custom column selector function
def custom_column_selector(dtype_checker):
    def selector(df):
        return [col for col, dtype in df.dtypes.items() if dtype_checker(dtype)]
    return selector

# Create preprocessor with transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
        ('cat', OneHotEncoder(sparse_output=False), make_column_selector(dtype_include=object)),
        ('datetime', datetime_transformer, custom_column_selector(is_datetime))
    ])

# Usage of the preprocessor
data_transformed = preprocessor.fit_transform(df)

After preprocessing:

num__a num__g cat__f_cat cat__f_dog datetime__b datetime__c datetime__d datetime__e
-1.5667 -1 1 0 2023 1970 2023 2023
-1.21854 1 0 1 2023 1970 2023 2023
-0.870388 -1 1 0 2023 1970 2023 2023
-0.522233 1 0 1 2023 1970 2023 2023
-0.174078 -1 1 0 2023 1970 2023 2023
0.174078 1 0 1 2023 1970 2023 2023
0.522233 -1 1 0 2023 1970 2023 2023
0.870388 1 0 1 2023 1970 2023 2023
1.21854 -1 1 0 2023 1970 2023 2023
1.5667 1 0 1 2023 1970 2023 2023
DataJanitor
  • 1,276
  • 1
  • 8
  • 19
  • Hey @DataJanitor, thank you for your answer! Is there also a possible way where you dont need to create a custom selector and instead check the dtypes inside the ``make_column_selector`` function? – JAdel Aug 30 '23 at 11:33
  • 1
    You're welcome, Jörg. As far as I know, unfortunately, there isn't as functions like my `is_datetime` cannot be passed to `make_column_selector`. I also thoght of using the `pattern` parameter (like this `r'datetime64\[ns(, .+)?\]'`), but it refers to the column name, not the dtype of the column. So just in case your datetime columns have a naming pattern, you might want to try the `pattern` parameter. – DataJanitor Aug 30 '23 at 12:33
0

make_column_selector calls pandas.DataFrame.select_dtypes, if you check the api page:

To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

So you can try:

df = pd.DataFrame({
    'a':range(10),
    'b':pd.to_datetime(range(10)),
    'c':pd.to_datetime(range(10),utc=True)
})
df.dtypes

a                  int64
b         datetime64[ns]
c    datetime64[ns, UTC]

from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include='datetime64[ns, UTC]')
selector(df)
StupidWolf
  • 45,075
  • 17
  • 40
  • 72