Using multiple parent IDs for cutoff times in deep feature synthesis

Question

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.

Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]

If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.

Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.

My cutoff_times df.head looks like this, and has multiple instances of some people_id:

+-------------------------------------------+
|         person_id       time        label |
+-------------------------------------------+
| 0      f_GZSVLYU 2019-12-06           0.0 |
| 1      f_ATBJEQS 2019-12-06           1.0 |
| 2      f_GLFYVAY 2019-12-06           0.5 |
| 3      f_DIHPTPA 2019-12-06           0.5 |
| 4      f_GZSVLYU 2019-12-02           1.0 |
+-------------------------------------------+

The Parent People Entity is like this:

+-------------------+
|       person_id   |
+-------------------+
| 0      f_GZSVLYU  |
| 1      f_ATBJEQS  |
| 2      f_GLFYVAY  |
| 3      f_DIHPTPA  |
| 4      f_DVOYHRQ  |
+-------------------+

How can I make featuretools understand what I'm trying to do?

'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?

score 3 · Answer 1 · answered Jan 05 '20 at 20:25

3

The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

answered Jan 05 '20 at 20:25

Gabe

89
5

2

That's correct. In the cutoff times, the ID and time can't have duplicate rows. A person can have the same cutoff time as another person, but the cutoff times for an individual person must be unique. I’d suggest looking into [Compose](https://github.com/FeatureLabs/compose) which is ideal for automatically generating the cutoff times based on how you define the prediction problem. – Jeff Hernandez Jan 06 '20 at 14:57

Using multiple parent IDs for cutoff times in deep feature synthesis

1 Answers1