1

I am working on a project, where I had to apply target encoding for 3 categorical variables:

merged_data['SpeciesEncoded'] = merged_data.groupby('Species')['WnvPresent'].transform(np.mean)
merged_data['BlockEncoded'] = merged_data.groupby('Block')['WnvPresent'].transform(np.mean)
merged_data['TrapEncoded'] = merged_data.groupby('Trap')['WnvPresent'].transform(np.mean)

I received the results and ran the model. Now the problem is that I have to apply the same model to test data that has columns Block, Trap, and Species, but doesn't have the values of the target variable WnvPresent (which has to be predicted).

How can I transfer my encoding from training sample to the test? I would greatly appreciate any help.

P.S. I hope it makes sense.

  • It's not clear what your transformation is doing without seeing a sample of your input data and output. Generally, if you're putting things through models, it makes sense to use a transformer from the sklearn ecosystem that has `fit` and `transform` methods, or else to define your own function or class that can save the state and parameters of your transformation – G. Anderson Jun 28 '21 at 23:26
  • Ah, it is a story about West Nile pandemic in Chicago. Species are types of mosquitos (total 7), block is the location of the trap, and the trap itself. WnvPresent tells if the mosquito in that trap is infected with a virus or not (0/1). Makes more sense now? – Pythonista0801 Jun 28 '21 at 23:50
  • See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and creating a [mcve]. Just remember when you post here, you've been staring at your code and data for hours, this is the first time we're seeing it. The more specific and detailed you can be the better we'll be able to help – G. Anderson Jun 29 '21 at 15:20
  • 1
    @G.Anderson that's very helpful! I am still new and learning how to post it right! Thank you so much! – Pythonista0801 Jun 29 '21 at 20:03

2 Answers2

1

There are 2 open source Python libraries that offer this functionality off-the-shelf: Feature-engine and Category encoders.

Assuming that we have a train and a testing set...

With Feature engine it would work as follows:

from feature_engine.encoding import MeanEncoder

# set up the encoder
encoder = MeanEncoder(variables=['Species', 'Block', 'Trap'])

# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])

# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

We find the replacement values in the encoding_dict_ attribute as follows:

encoder.encoding_dict_

With category encoders it works as follows:

from category_encoders.target_encoder import TargetEncoder

# set up the encoder
encoder = TargetEncoder(cols=['Species', 'Block', 'Trap'])

# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])

# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

The replacement values can be found in the attribute mapping:

encoder.mapping

More details in the respective documentation:

Category encoders' TargetEncoder also offers smoothing as suggested by @andrey-lukyanenko out-of-the-box.

Sole Galli
  • 827
  • 6
  • 21
0

You need to same the mapping between the feature and the mean value, if you want to apply it to the test dataset.

Here is a possible solution:

species_encoding = df.groupby(['Species'])['WnvPresent'].mean().to_dict()
block_encoding = df.groupby(['Block'])['WnvPresent'].mean().to_dict()
trap_encoding = df.groupby(['Trap'])['WnvPresent'].mean().to_dict()
merged_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
merged_data['BlockEncoded'] = df['Block'].map(species_encoding)
merged_data['TrapEncoded'] = df['Trap'].map(species_encoding)
test_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
test_data['BlockEncoded'] = df['Block'].map(species_encoding)
test_data['TrapEncoded'] = df['Trap'].map(species_encoding)

This would answer your question, but I want to add, that this approach can be improved. Directly using mean values of targets could make the models overfit on the data.

There are many approaches to improve target encoding, one of them is smoothing, here is a link to an example: https://maxhalford.github.io/blog/target-encoding/

Here is an example:

m = 10
mean = df['WnvPresent'].mean()

# Compute the number of values and the mean of each group
agg = df.groupby('Species')['WnvPresent'].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']

# Compute the "smoothed" means
species_encoding = ((counts * means + m * mean) / (counts + m)).to_dict()
Andrey Lukyanenko
  • 3,679
  • 2
  • 18
  • 21