I am building a custom transformer that implements a couple of steps to preprocess data. The first is that it applies a set of functions that I wrote that will take existing features and engineer new ones. From there, the categorical variables will be one-hot encoded. The last step will be to drop features or columns from the DataFrame that are no longer needed.
The dataset I'm using is the Kaggle House Prices dataset.
The problem here is ensuring the categorical dummied variables in the test set are the same as the training set because some of the categories for a certain feature in the training set might not be in the test set and therefore the test set won't have a dummy variable for that category. I've done research and I ran into this solution and I'm trying to implement the first answer in my custom transformer class. First, I'm not sure if this is the best way to do it. Second I'm getting an error talked about below.
I've included the full list of the functions I apply to the data but only show a couple of the actual functions below.
class HouseFeatureTransformer(BaseEstimator, TransformerMixin):
def __init__(self, funcs, func_cols, drop_cols, drop_first=True):
self.funcs = funcs
self.func_cols = func_cols
self.train_cols = None
self.drop_cols = drop_cols
self.drop_first = drop_first
def fit(self, X, y=None):
X_trans = self.apply_funcs(X)
X_trans.drop(columns=self.drop_cols, inplace=True)
#save training_columns to compare to columns of any later seen dataset
self.train_cols = X_trans.columns
return self
def transform(self, X, y=None):
X_test = self.apply_funcs(X)
X_test.drop(columns=self.drop_cols, inplace=True)
test_cols = X_test.columns
#ensure that all columns in the training set are present in the test set
#set should be empty for first fit_transform
missing_cols = set(self.train_cols) - set(test_cols)
for col in missing_cols:
X_test[col] = 0
#reduce columns in test set to only what was in the training set
X_test = X_test[self.train_cols]
return X_test.values
def apply_funcs(self, X):
#apply each function to respective column
for func, func_col in zip(self.funcs, self.func_cols):
X[func_col] = X.apply(func, axis=1)
#one hot encode categorical variables
X = pd.get_dummies(X, drop_first=self.drop_first)
return X
#functions to apply
funcs = [sold_age, yrs_remod, lot_shape, land_slope, rfmat, bsmt_bath, baths,
other_rooms, fence_qual, newer_garage]
#feature names
func_cols = ['sold_age', 'yr_since_remod', 'LotShape', 'LandSlope', 'RoofMatl', 'BsmtBaths', 'Baths', \
'OtherRmsAbvGr', 'Fence', 'newer_garage']
#features to drop
to_drop = ['Alley', 'Utilities', 'Condition2', 'HouseStyle', 'LowQualFinSF', 'EnclosedPorch', \
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'MiscFeature', 'MiscVal', \
'YearBuilt', 'YrSold', 'YearRemodAdd', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', \
'TotRmsAbvGrd', 'GarageYrBlt', '1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'ExterQual', \
'ExterCond', 'BsmtQual', 'BsmtCond', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'BsmtFinType2', \
'Exterior1st', 'Exterior2nd', 'GarageCars', 'Functional', 'SaleType', 'SaleCondition']
#functions to transform data
def sold_age(row):
'''calculates the age of the house when it was sold'''
return row['YrSold'] - row['YearBuilt']
def yrs_remod(row):
'''calculates the years since house was remodeled'''
yr_blt = row['YearBuilt']
yr_remodeled = row['YearRemodAdd']
yr_sold = row['YrSold']
if yr_blt == yr_remodeled:
return 0
else:
return yr_sold - yr_remodeled
def lot_shape(row):
'''consolidates all irregular categories into one'''
if row['LotShape'] == 'Reg':
return 'Reg'
else:
return 'Irreg'
During the fit, I apply the functions, dummy the categoricals, drop the unneeded columns, then save the columns to self.train_cols. When I do the transformation, I do the same steps except I save the transformed columns to test_cols. I compare these columns to the columns obtained in the fit and add any missing columns from test set that was in the training as shown in the answer I linked. The error I get is below:
KeyError: "['Alley' 'Utilities' 'Condition2' 'HouseStyle' 'PoolQC' 'MiscFeature'\n 'ExterQual' 'ExterCond' 'BsmtQual' 'BsmtCond' 'KitchenQual' 'FireplaceQu'\n 'GarageQual' 'GarageCond' 'BsmtFinType2' 'Exterior1st' 'Exterior2nd'\n 'Functional' 'SaleType' 'SaleCondition'] not found in axis"
I'm trying to understand why I'm getting this error and if there's a better way to implement this process than how I'm doing it.