I'm generating historical features for the current row with featuretools
. For example, the number of transactions made in the last hour during a session.
Package featuretools
includes parameter cutoff_time
to exclude all rows that come after cutoff_time
in time.
I set cutoff_time
as time_index value - 1 second
, so I expect the features to be based on historical data minus the current row. This allows including the response variable from historical rows.
The problem is, when this parameter does not equal the time_index
variable, I get a bunch of NaN
s in the original and generated features.
Example:
#!/usr/bin/env python3
import featuretools as ft
import pandas as pd
from featuretools import primitives, variable_types
data = ft.demo.load_mock_customer()
transactions_df = data['transactions']
transactions_df['cutoff_time'] = transactions_df['transaction_time'] - pd.Timedelta(seconds=1)
es = ft.EntitySet('transactions_set')
es.entity_from_dataframe(
entity_id='transactions',
dataframe=transactions_df,
variable_types={
'transaction_id': variable_types.Index,
'session_id': variable_types.Id,
'transaction_time': variable_types.DatetimeTimeIndex,
'product_id': variable_types.Id,
'amount': variable_types.Numeric,
'cutoff_time': variable_types.Datetime
},
index='transaction_id',
time_index='transaction_time'
)
es.normalize_entity(
base_entity_id='transactions',
new_entity_id='sessions',
index='session_id'
)
es.add_last_time_indexes()
fm, features = ft.dfs(
entityset=es,
target_entity='transactions',
agg_primitives=[primitives.Sum, primitives.Count],
trans_primitives=[primitives.Day],
cutoff_time=transactions_df[['transaction_id', 'cutoff_time']].
rename(index=str, columns={'transaction_id': 'transaction_id', 'cutoff_time': 'time'}),
training_window='1 hours',
verbose=True
)
print(fm)
Output (excerpt):
DAY(cutoff_time) sessions.SUM(transactions.amount) \
transaction_id
352 NaN NaN
186 NaN NaN
319 NaN NaN
256 NaN NaN
449 NaN NaN
40 NaN NaN
13 NaN NaN
127 NaN NaN
21 NaN NaN
309 NaN NaN
Column sessions.SUM(transactions.amount)
is supposed to be >= 0. Original features session_id product_id amount
are all NaN
as well.
If transactions_df['cutoff_time'] = transactions_df['transaction_time']
(no time delta), this code works but includes the current row.
What is the right way to calculate aggregates and transformations that would exclude the current row from calculations?