2

I am following the H2O example to run target mean encoding in Sparking Water (sparking water 2.4.2 and H2O 3.22.04). It runs well in all the following paragraph

from h2o.targetencoder import TargetEncoder

# change label to factor
input_df_h2o['label'] = input_df_h2o['label'].asfactor()

# add fold column for Target Encoding
input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321)

# find all categorical features
cat_features = [k for (k,v) in input_df_h2o.types.items() if v in ('string')]
# convert string to factor
for i in cat_features:
    input_df_h2o[i] = input_df_h2o[i].asfactor()

# target mean encode
targetEncoder = TargetEncoder(x= cat_features, y = y, fold_column = "cv_fold_te", blending_avg=True)
targetEncoder.fit(input_df_h2o)

But when I start to use the same data set used to fit Target Encoder to run the transform code (see code below):

ext_input_df_h2o = targetEncoder.transform(frame=input_df_h2o,
                                    holdout_type="kfold", # mean is calculating on out-of-fold data only; loo means leave one out
                                    is_train_or_valid=True,
                                    noise = 0, # determines if random noise should be added to the target average
                                    seed=54321)

I will have error like

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6773422589366407956.py", line 331, in <module>
    exec(code)
  File "<stdin>", line 5, in <module>
  File "/usr/lib/envs/env-1101-ver-1619-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/h2o/targetencoder.py", line 97, in transform
    assert self._encodingMap.map_keys['string'] == self._teColumns
AssertionError

I found the code in its source code http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/targetencoder.html enter image description here but how to fix this issue? It is the same table used to run the fit.

Gavin
  • 1,411
  • 5
  • 18
  • 31

2 Answers2

2

The issue is because you are trying encoding multiple categorical features. I think that is a bug of H2O, but you can solve putting the transformer in a for loop that iterate over all categorical names.

import numpy as np
import pandas as pd
import h2o
from h2o.targetencoder import TargetEncoder
h2o.init()

df = pd.DataFrame({
    'x_0': ['a'] * 5 + ['b'] * 5,
    'x_1': ['c'] * 9 + ['d'] * 1,
    'x_2': ['a'] * 3 + ['b'] * 7,
    'y_0': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})

hf = h2o.H2OFrame(df)
hf['cv_fold_te'] = hf.kfold_column(n_folds=2, seed=54321)
hf['y_0'] = hf['y_0'].asfactor()
cat_features = ['x_0', 'x_1', 'x_2']

for item in cat_features:
    target_encoder = TargetEncoder(x=[item], y='y_0', fold_column = 'cv_fold_te')
    target_encoder.fit(hf)
    hf = target_encoder.transform(frame=hf, holdout_type='kfold',
                                  seed=54321, noise=0.0)
hf
joefaver
  • 36
  • 3
  • But the example from https://github.com/h2oai/h2o-tutorials/blob/78c3766741e8cbbbd8db04d54b1e34f678b85310/best-practices/feature-engineering/feature_engineering.ipynb is using two categorical features "targetEncoder = TargetEncoder(x= ["addr_state", "purpose"], y = "bad_loan", fold_column = "cv_fold_te")" – Gavin Apr 13 '19 at 03:32
  • what interesting is when I test your example use 2 features in TargetEncoder, it works without any issue but 3 features, it will have the error message I listed in the thread above – Gavin Apr 13 '19 at 03:39
  • I followed the same demo and got the same error. Just trying with different features and number of features, I could solve it. Make sure you do not have missing data before to encoding. I had a pandas dataframe and used the imputer shared by @sveitser in [https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn] because I had a small percentages of missing values. After that, I could encoding fourteen categorical variables with for loop. It should be noted that the target encoder in H2O.ai is still in alpha version. PD: I am using h2o 3.22.1.2 – joefaver Apr 14 '19 at 11:43
  • @joefaver Could you please share more details about the issue that you had been experiencing with missing values? It is supposed to work fine with missing data so lets see if it is a bug. – Deil May 13 '19 at 10:50
0

Thanks everyone for letting us know. Assertion was a precaution as I was not sure whether there could be the case that order could be changed. Rest of the code was written with this assumption in mind and therefore safe to use with changed order anyway, but assertion was left and forgotten. Added test and removed assertion. Now this issue is fixed and merged. Should be available in the upcoming fix release. 0xdata.atlassian.net/browse/PUBDEV-6474

Deil
  • 492
  • 4
  • 14