3

Im using featuretools and I need to create a feature that uses the cutoff time for its calculation.

My entityset consist in a client table and a subscription table (it has more but for the question only these are necessary):

import featuretools as ft
import pandas as pd


client_table = pd.DataFrame({'client_id': (1,2,3), 
                             'start_date': (dt.date(2015,1,1),dt.date(2017,10,15),dt.date(2011,1,10))})

subscription_table = pd.DataFrame({'client_id': (1,3,1,2), 
                             'start_plan_date': (dt.date(2015,1,1),dt.date(2011,1,10), dt.date(2018,2,1),dt.date(2017,10,15)), 
                             'end_plan_date':(dt.date(2018,2,1),dt.date(2019,1,10), dt.date(2021,2,1),dt.date(2019,10,15))})

client table

client_id   start_date
0   1       2015-01-01
1   2       2017-10-15
2   3       2011-01-10

substription table

subscription_id client_id   start_plan_date end_plan_date
0   1   1   2015-01-01      2018-02-01
1   2   3   2011-01-10      2019-01-10
2   3   1   2018-02-01      2021-02-01
3   4   2   2017-10-15      2019-10-15

I created the entity set using client_id as key and setting start_date as time_index

es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id="client", 
                              dataframe=client_table,
                              index="client_id",
                              time_index="start_date")

es = es.entity_from_dataframe(entity_id="subscription", 
                              dataframe=subscription_table,
                              index="subscription_id",
                              time_index="start_plan_date", 
                              variable_types={"client_id": ft.variable_types.Index,
                                             "end_plan_date": ft.variable_types.Datetime})

relation= ft.Relationship(es["client"]["client_id"],es["subscription"]["client_id"])

es = es.add_relationship(relation)
print(es)

Out:

Entityset: None
  Entities:
    subscription [Rows: 4, Columns: 4]
    client [Rows: 3, Columns: 2]
  Relationships:
    subscription.client_id -> client.client_id

Now, I need to create a feature that estimates the time between the cutoff time (i.e. 01/01/2018) and the closest end_plan_date for each client. In algebraic form the calculation should be

time_remaining_in_plan = max(subscription.end_plan_date - cutoff_time)

Also I need to calculate the amount of time since the client started:

time_since_start = cutoff_time - client.start_date

In my example the expected output for those features should look like this (im assuming the time differences in days, but it could be months also, also im using a time range for the cutoff times):

client_id cutoff_time time_remaining_in_plan  time_since_start
0   3     2018-10-31  71                      2851
1   3     2018-11-30  41                      2881
2   1     2018-10-31  824                     1399
3   1     2018-11-30  794                     1429
4   2     2018-10-31  349                     381
5   2     2018-11-30  319                     411

Is there a way to use featuretools to create custom primitives (aggregation or transformation) or seed features that can generate this result??

Thanks!!

1 Answers1

4

This can be done with custom primitives that use the use_calc_time parameter. This parameter will set up the primitive such that the cutoff time gets passed to it during calculation.

In your case, we need to define two primitives

from featuretools.primitives import make_trans_primitive
from featuretools.variable_types import Datetime, Numeric


def time_until(array, time):
    diff = pd.DatetimeIndex(array) - time
    return diff.days


TimeUntil = make_trans_primitive(function=time_until,
                                 input_types=[Datetime],
                                 return_type=Numeric,
                                 uses_calc_time=True,
                                 description="Calculates time until the cutoff time in days",
                                 name="time_until")


def time_since(array, time):
    diff = time - pd.DatetimeIndex(array)
    return diff.days


TimeSince = make_trans_primitive(function=time_since,
                                 input_types=[Datetime],
                                 return_type=Numeric,
                                 uses_calc_time=True,
                                 description="Calculates time since the cutoff time in days",
                                 name="time_since")

Then we can use the primitives in a call to ft.dfs

cutoff_times = pd.DataFrame({
    "client_id": [1, 1, 2, 2, 3, 3],
    "cutoff_time": pd.to_datetime([dt.date(2018,10,31), dt.date(2018,11,30)]*3)
    })

fm, fl = ft.dfs(entityset=es,
                target_entity="client",
                cutoff_time=cutoff_times,
                agg_primitives=["max"],
                trans_primitives=[TimeUntil, TimeSince],
                cutoff_time_in_index=True)


# these columns correspond to time_remaining_in_plan and time_since_start
fm = fm[["MAX(subscription.TIME_UNTIL(end_plan_date))", "TIME_SINCE(start_date)"]]

this returns

                      MAX(subscription.TIME_UNTIL(end_plan_date))  TIME_SINCE(start_date)
client_id time                                                                           
1         2018-10-31                                         -272                    1399
2         2018-10-31                                          349                     381
3         2018-10-31                                           71                    2851
1         2018-11-30                                         -302                    1429
2         2018-11-30                                          319                     411
3         2018-11-30                                           41                    2881

This matches the result you're looking for in your answer with the exception of time_remaining_in_plan for client id 1. I double checked the numbers Feauturetools came up and I believe they are right for this dataset.

Max Kanter
  • 2,006
  • 6
  • 16
  • 1
    Thank you Max, that's exactly what I was looking for! The data for client 1 was wrong (I had two subscriptions for client 1 with the same end date). I'm changing it now on the question – Jose Manuel Peña Mendez Dec 10 '18 at 23:09