5

I have a single dataframe and want to use featuretools for auto feature engineering part. I am able to do it with normalize entities function. code snippet is below:

es = ft.EntitySet(id = 'obs_data')
es = es.entity_from_dataframe(entity_id = 'obs', dataframe = X_train,
                              variable_types = variable_types, make_index = True, index = "Id")
for feat in interaction:   # interaction columns are found using xgbfir
    es = es.normalize_entity(base_entity_id='obs', new_entity_id=feat, index=feat)
features, feature_names = ft.dfs(entityset = es, 
                                 target_entity = 'obs', 
                                 max_depth = 2)

Its creating features, Now I want to do same thing for X_test. I read blogs on this and they are suggesting to combine X_train and X_test and then do the same process. suppose there are 5 obs in X_test and if i combine it with X_train, then each observation (from X_test) will have effect of other 4 observation (X_test) also, which is not a good idea. Anyone can suggest how to do feature engineering using featuretools for the new data?

Mohit Sharma
  • 590
  • 3
  • 10

2 Answers2

1

You can try using cutoff times which specifies the last point in time that an observation can be used for a feature calculation. The labels can be passed along with the cutoff times to ensure that they stay aligned with the feature matrix. Then, you can split the feature matrix to X_train and X_test.

With new data, the normalization should be repeatable so that the entity set can have the same structure. Then, you can calculate features with cutoff times as usual. You may also want to look into Compose which automatically generates the cutoff times based on how you define the prediction problem. If cutoff times don't work in your use case, I will need more details to better understand how each observation will have an effect on the others. Let me know if this helps.

Jeff Hernandez
  • 2,063
  • 16
  • 20
1

It is possible with calculate_feature_matrix() in featuretools. You can get detailed guide from its webpage: https://docs.featuretools.com/en/stable/guides/deployment.html#calculating-feature-matrix-for-new-data

Suppose new data is X_test. If it is a dataframe, you should create an entityset for it.

es_test = es.entity_from_dataframe(entity_id = 'entity', dataframe = X_test)

Otherwise, if it is an entity already, you can skip previous step. Suppose your test entity is es_test and your generated feature names is feature_names. By using train data's feature names you can create a new feature matrix for test data.

test_feat_generated= ft.calculate_feature_matrix(feature_names, es_test)

For later use of feature_names, you can look load_features(), save_features() functions.

Note: Train and test entities should have the same entity_id otherwise you would get an error.

  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – THess Feb 12 '20 at 08:28