Excluding the current row from feature engineering in Python featuretools

Question

I'm generating historical features for the current row with featuretools. For example, the number of transactions made in the last hour during a session.

Package featuretools includes parameter cutoff_time to exclude all rows that come after cutoff_time in time.

I set cutoff_time as time_index value - 1 second, so I expect the features to be based on historical data minus the current row. This allows including the response variable from historical rows.

The problem is, when this parameter does not equal the time_index variable, I get a bunch of NaNs in the original and generated features.

Example:

#!/usr/bin/env python3

import featuretools as ft
import pandas as pd
from featuretools import primitives, variable_types

data = ft.demo.load_mock_customer()

transactions_df = data['transactions']
transactions_df['cutoff_time'] = transactions_df['transaction_time'] - pd.Timedelta(seconds=1)

es = ft.EntitySet('transactions_set')
es.entity_from_dataframe(
    entity_id='transactions',
    dataframe=transactions_df,
    variable_types={
        'transaction_id': variable_types.Index,
        'session_id': variable_types.Id,
        'transaction_time': variable_types.DatetimeTimeIndex,
        'product_id': variable_types.Id,
        'amount': variable_types.Numeric,
        'cutoff_time': variable_types.Datetime
    },
    index='transaction_id',
    time_index='transaction_time'
)
es.normalize_entity(
    base_entity_id='transactions',
    new_entity_id='sessions',
    index='session_id'
)
es.add_last_time_indexes()

fm, features = ft.dfs(
    entityset=es,
    target_entity='transactions',
    agg_primitives=[primitives.Sum, primitives.Count],
    trans_primitives=[primitives.Day],
    cutoff_time=transactions_df[['transaction_id', 'cutoff_time']].
        rename(index=str, columns={'transaction_id': 'transaction_id', 'cutoff_time': 'time'}),
    training_window='1 hours',
    verbose=True
)

print(fm)

Output (excerpt):

                DAY(cutoff_time)  sessions.SUM(transactions.amount)  \
transaction_id                                                        
352                          NaN                                NaN   
186                          NaN                                NaN   
319                          NaN                                NaN   
256                          NaN                                NaN   
449                          NaN                                NaN   
40                           NaN                                NaN   
13                           NaN                                NaN   
127                          NaN                                NaN   
21                           NaN                                NaN   
309                          NaN                                NaN

Column sessions.SUM(transactions.amount) is supposed to be >= 0. Original features session_id product_id amount are all NaN as well.

If transactions_df['cutoff_time'] = transactions_df['transaction_time'] (no time delta), this code works but includes the current row.

What is the right way to calculate aggregates and transformations that would exclude the current row from calculations?

score 2 · Accepted Answer · answered Jul 05 '18 at 20:36

What you are seeing is the intended behavior of cutoff times and time_index. The time_index of an entity represents the first time that any information can be known for each instance. When you provide a cutoff time to Featuretools, it simulates the state of the dataset at the dataset by removing rows where the time index is after the cutoff time.

In this case, the transaction_id and session_id for a transaction are unknown prior to the transaction time, which makes sense because the transaction hasn’t occurred. That is why when you ask Featuretools to calculate features a second before the transaction time, it returns NaN for all the features.

The way to handle this is to assign a secondary_time_index to variables like amount in transactions. This is described in the advanced solution of this Stack Overflow answer. This allows you to tell Featuretools that a particular variable isn't valid for use at the transaction_time and can only be used at the time in your secondary time index column. Essentially, you would be blocking some of the row values from being used at the transaction time, while allowing other values. You can assign a secondary time index to any number of variables in that entity.

score 1 · Answer 2 · answered Jul 06 '18 at 09:54

Based on Max Kanter's answer:

#!/usr/bin/env python3

import featuretools as ft
import pandas as pd
from featuretools import primitives, variable_types

data = ft.demo.load_mock_customer()

transactions_df = data['transactions']
transactions_df['response_time'] = transactions_df['transaction_time'] + pd.Timedelta(seconds=1)

es = ft.EntitySet('transactions_set')
es.entity_from_dataframe(
    entity_id='transactions',
    dataframe=transactions_df,
    variable_types={
        'transaction_id': variable_types.Index,
        'session_id': variable_types.Id,
        'transaction_time': variable_types.DatetimeTimeIndex,
        'product_id': variable_types.Id,
        'amount': variable_types.Numeric,
        'response_time': variable_types.Datetime
    },
    index='transaction_id',
    time_index='transaction_time',
    secondary_time_index={'response_time': ['amount', 'transaction_id']}
)
es.normalize_entity(
    base_entity_id='transactions',
    new_entity_id='sessions',
    index='session_id'
)
es.add_last_time_indexes()

fm, features = ft.dfs(
    entityset=es,
    target_entity='transactions',
    agg_primitives=[primitives.Sum, primitives.Count],
    trans_primitives=[primitives.Day],
    cutoff_time=transactions_df[['transaction_id', 'transaction_time']],
    cutoff_time_in_index=True,
    training_window='5 minutes',
    verbose=True
)

print(fm)

This code produces features sessions.SUM(transactions.amount) and sessions.COUNT(transactions) that exclude the current row and include all the previous rows less than 5 minutes old.

Excluding the current row from feature engineering in Python featuretools

2 Answers2