1

Background: I am trying to learn from a notebook used in Kaggle House Price Prediction Dataset.

I am trying to use a Pipeline to transform numerical and categorical columns in a dataframe. It is having issues with my Categorical variables' names, which is a list stored in this variable categ_cols_names. It says that those categorical columns are not unique in dataframe, which I'm not sure what that means.

categ_cols_names = ['MSZoning','Street','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','Electrical','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','KitchenQual','Functional','Fireplaces','GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageQual','GarageCond','PavedDrive','MoSold','YrSold','SaleType','SaleCondition','OverallQual','GarageCars','FullBath','YearBuilt']

Below is my code:

# Get numerical columns names
num_cols_names = X_train.columns[X_train.dtypes != object].to_list()

# Numerical columns with missing values
num_nan_cols = X_train[num_cols_names].columns[X_train[num_cols_names].isna().sum() > 0]

# Assign np.nan type to NaN values in categorical features
# in order to ensure detectability in posterior methods
X_train[num_nan_cols] = X_train[num_nan_cols].fillna(value = np.nan, axis = 1)

# Define pipeline for imputation of the numerical features
num_pipeline = Pipeline(steps = [
                                    ('Simple Imputer', SimpleImputer(strategy = 'median')),
                                    ('Robust Scaler', RobustScaler()),
                                    ('Power Transformer', PowerTransformer())
                                ]
                        )

# Get categorical columns names
categ_cols_names = X_train.columns[X_train.dtypes == object].to_list()

# Categorical columns with missing values
categ_nan_cols = X_train[categ_cols_names].columns[X_train[categ_cols_names].isna().sum() > 0]

# Assign np.nan type to NaN values in categorical features
# in order to ensure detectability in posterior methods
X_train[categ_nan_cols] = X_train[categ_nan_cols].fillna(value = np.nan, axis = 1)

# Define pipeline for imputation and encoding of the categorical features
categ_pipeline = Pipeline(steps = [
('Categorical Imputer', SimpleImputer(strategy = 'most_frequent')),
('One Hot Encoder', OneHotEncoder(drop = 'first'))
])

ct = ColumnTransformer([
('Categorical Pipeline', categ_pipeline, categ_cols_names),
('Numerical Pipeline', num_pipeline, num_cols_names)], 
                        remainder        = 'passthrough', 
                        sparse_threshold = 0,
                        n_jobs           = -1)

pipe = Pipeline(steps = [('Column Transformer', ct)])
pipe.fit_transform(X_train)

The ValueError occurs on the .fit_transform() line: enter image description here

Here is a sample of my X_train:

{'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'},
 'Street': {0: 'Pave', 1: 'Pave', 2: 'Pave', 3: 'Pave', 4: 'Pave'},
 'LotShape': {0: 'Reg', 1: 'Reg', 2: 'IR1', 3: 'IR1', 4: 'IR1'},
 'LandContour': {0: 'Lvl', 1: 'Lvl', 2: 'Lvl', 3: 'Lvl', 4: 'Lvl'},
 'Utilities': {0: 'AllPub',
  1: 'AllPub',
  2: 'AllPub',
  3: 'AllPub',
  4: 'AllPub'},
 'LotConfig': {0: 'Inside', 1: 'FR2', 2: 'Inside', 3: 'Corner', 4: 'FR2'},
 'LandSlope': {0: 'Gtl', 1: 'Gtl', 2: 'Gtl', 3: 'Gtl', 4: 'Gtl'},
 'Neighborhood': {0: 'CollgCr',
  1: 'Veenker',
  2: 'CollgCr',
  3: 'Crawfor',
  4: 'NoRidge'},
 'Condition1': {0: 'Norm', 1: 'Feedr', 2: 'Norm', 3: 'Norm', 4: 'Norm'},
 'Condition2': {0: 'Norm', 1: 'Norm', 2: 'Norm', 3: 'Norm', 4: 'Norm'},
 'BldgType': {0: '1Fam', 1: '1Fam', 2: '1Fam', 3: '1Fam', 4: '1Fam'},
 'HouseStyle': {0: '2Story',
  1: '1Story',
  2: '2Story',
  3: '2Story',
  4: '2Story'},
 'OverallQual': {0: '7', 1: '6', 2: '7', 3: '7', 4: '8'},
 'OverallCond': {0: '5', 1: '8', 2: '5', 3: '5', 4: '5'},
 'YearBuilt': {0: '2003', 1: '1976', 2: '2001', 3: '1915', 4: '2000'},
 'YearRemodAdd': {0: '2003', 1: '1976', 2: '2002', 3: '1970', 4: '2000'},
 'RoofStyle': {0: 'Gable', 1: 'Gable', 2: 'Gable', 3: 'Gable', 4: 'Gable'},
 'RoofMatl': {0: 'CompShg',
  1: 'CompShg',
  2: 'CompShg',
  3: 'CompShg',
  4: 'CompShg'},
 'Exterior1st': {0: 'VinylSd',
  1: 'MetalSd',
  2: 'VinylSd',
  3: 'Wd Sdng',
  4: 'VinylSd'},
 'Exterior2nd': {0: 'VinylSd',
  1: 'MetalSd',
  2: 'VinylSd',
  3: 'Wd Shng',
  4: 'VinylSd'},
 'MasVnrType': {0: 'BrkFace',
  1: 'None',
  2: 'BrkFace',
  3: 'None',
  4: 'BrkFace'},
 'ExterQual': {0: 'Gd', 1: 'TA', 2: 'Gd', 3: 'TA', 4: 'Gd'},
 'ExterCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
 'Foundation': {0: 'PConc', 1: 'CBlock', 2: 'PConc', 3: 'BrkTil', 4: 'PConc'},
 'BsmtQual': {0: 'Gd', 1: 'Gd', 2: 'Gd', 3: 'TA', 4: 'Gd'},
 'BsmtCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'Gd', 4: 'TA'},
 'BsmtExposure': {0: 'No', 1: 'Gd', 2: 'Mn', 3: 'No', 4: 'Av'},
 'BsmtFinType1': {0: 'GLQ', 1: 'ALQ', 2: 'GLQ', 3: 'ALQ', 4: 'GLQ'},
 'BsmtFinType2': {0: 'Unf', 1: 'Unf', 2: 'Unf', 3: 'Unf', 4: 'Unf'},
 'Heating': {0: 'GasA', 1: 'GasA', 2: 'GasA', 3: 'GasA', 4: 'GasA'},
 'HeatingQC': {0: 'Ex', 1: 'Ex', 2: 'Ex', 3: 'Gd', 4: 'Ex'},
 'CentralAir': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y'},
 'Electrical': {0: 'SBrkr', 1: 'SBrkr', 2: 'SBrkr', 3: 'SBrkr', 4: 'SBrkr'},
 'BsmtFullBath': {0: '1', 1: '0', 2: '1', 3: '1', 4: '1'},
 'BsmtHalfBath': {0: '0', 1: '1', 2: '0', 3: '0', 4: '0'},
 'FullBath': {0: '2', 1: '2', 2: '2', 3: '1', 4: '2'},
 'HalfBath': {0: '1', 1: '0', 2: '1', 3: '0', 4: '1'},
 'BedroomAbvGr': {0: '3', 1: '3', 2: '3', 3: '3', 4: '4'},
 'KitchenAbvGr': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
 'KitchenQual': {0: 'Gd', 1: 'TA', 2: 'Gd', 3: 'Gd', 4: 'Gd'},
 'Functional': {0: 'Typ', 1: 'Typ', 2: 'Typ', 3: 'Typ', 4: 'Typ'},
 'Fireplaces': {0: '0', 1: '1', 2: '1', 3: '1', 4: '1'},
 'GarageType': {0: 'Attchd',
  1: 'Attchd',
  2: 'Attchd',
  3: 'Detchd',
  4: 'Attchd'},
 'GarageYrBlt': {0: '2003.0',
  1: '1976.0',
  2: '2001.0',
  3: '1998.0',
  4: '2000.0'},
 'GarageFinish': {0: 'RFn', 1: 'RFn', 2: 'RFn', 3: 'Unf', 4: 'RFn'},
 'GarageCars': {0: '2', 1: '2', 2: '2', 3: '3', 4: '3'},
 'GarageQual': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
 'GarageCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
 'PavedDrive': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y'},
 'MoSold': {0: '2', 1: '5', 2: '9', 3: '2', 4: '12'},
 'YrSold': {0: '2008', 1: '2007', 2: '2008', 3: '2006', 4: '2008'},
 'SaleType': {0: 'WD', 1: 'WD', 2: 'WD', 3: 'WD', 4: 'WD'},
 'SaleCondition': {0: 'Normal',
  1: 'Normal',
  2: 'Normal',
  3: 'Abnorml',
  4: 'Normal'},
 'GrLivArea': {0: 1710, 1: 1262, 2: 1786, 3: 1717, 4: 2198},
 'GarageArea': {0: 548, 1: 460, 2: 608, 3: 642, 4: 836},
 'TotalBsmtSF': {0: 856, 1: 1262, 2: 920, 3: 756, 4: 1145},
 '1stFlrSF': {0: 856, 1: 1262, 2: 920, 3: 961, 4: 1145},
 'TotRmsAbvGrd': {0: 8, 1: 6, 2: 6, 3: 7, 4: 9}}
Katsu
  • 8,479
  • 3
  • 15
  • 16
  • I have come across the same issue recently. If you do `df.columns` the result is an `Index` or a `MultiIndex` object? – thanasissdr Mar 09 '23 at 21:44

1 Answers1

1

I recently ran into the same issue and was able to resolve it by removing duplicates from the categorical variables list. Your list also has duplicates in it, which can be verified programatically:

categ_cols_names = ['MSZoning','Street','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','Electrical','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','KitchenQual','Functional','Fireplaces','GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageQual','GarageCond','PavedDrive','MoSold','YrSold','SaleType','SaleCondition','OverallQual','GarageCars','FullBath','YearBuilt']

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

list_duplicates(categ_cols_names)

The duplicates are:

['FullBath', 'YearBuilt', 'GarageCars', 'OverallQual']

Try removing those from categ_cols_names and re-running your code. For what it's worth, I don't think the error message is very helpful, as it makes it difficult to debug that there could be a single duplicated column among a long list of them.

D. Ross
  • 133
  • 8