19

This is my first machine learning project and the first time that I use ColumnTransformer. My aim is to perform two steps of data preprocessing, and use ColumnTransformer for each of them.

In the first step, I want to replace the missing values in my dataframe with the string 'missing_value' for some features, and the most frequent value for the remaining features. Therefore, I combine these two operations using ColumnTransformer and passing to it the corresponding columns of my dataframe.

In the second step, I want to use the just preprocessed data and apply OrdinalEncoder or OneHotEncoder depending on the features. For that I use again ColumnTransformer.

I then combine the two steps into a single pipeline.

I am using the Kaggle Houses Price dataset, I have scikit-learn version 0.20 and this is a simplified version of my code:

cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'],                          # Street
               ['missing_value', 'Pave', 'Grvl'],         # Alley
               ['missing_value', 'Fa', 'TA', 'Gd', 'Ex']  # PoolQC
]
cat_columns_onehot = ['MSZoning', 'LandContour']


imputer_cat_pipeline = ColumnTransformer([
        ('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss),  # fill_value='missing_value' by default
        ('imp_freq', SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
])

encoder_cat_pipeline = ColumnTransformer([
        ('ordinal', OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
        ('pass_ord', OneHotEncoder(), cat_columns_onehot),
])

cat_pipeline = Pipeline([
        ('imp_cat', imputer_cat_pipeline),
        ('cat_encoder', encoder_cat_pipeline),
])

Unfortunately, when I apply it to housing_cat, the subset of my dataframe including only categorical features,

cat_pipeline.fit_transform(housing_cat)

I get the error:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

...

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

I have tried this simplified pipeline and it works properly:

new_cat_pipeline = Pipeline([
        ('imp_cat', imputer_cat_pipeline),
        ('onehot', OneHotEncoder()),
])

However, if I try:

enc_one = ColumnTransformer([
        ('onehot', OneHotEncoder(), cat_columns_onehot),
        ('pass_ord', 'passthrough', cat_columns_ord)
])

new_cat_pipeline = Pipeline([
        ('imp_cat', imputer_cat_pipeline),
        ('onehot_encoder', enc_one),
])

I start to get the same error.

I suspect then that this error is related to the use of ColumnTransformer in the second step, but I do not actually understand where it comes from. The way I identify the columns in the second step is the same as in the first step, so it remains unclear to me why only in the second step I get the Attribute Error...

Community
  • 1
  • 1
Giulia
  • 205
  • 1
  • 2
  • 5

4 Answers4

13

ColumnTransformer returns numpy.array, so it can't have column attribute (as indicated by your error).

If I may suggest a different solution, use pandas for both of your tasks, it will be easier.

Step 1 - replacing missing values

To replace missing value in a subset of columns with missing_value string use this:

dataframe[["PoolQC", "Alley"]].fillna("missing_value", inplace=True)

For the rest (imputing with mean of each column), this will work perfectly:

dataframe[["Street", "MSZoning", "LandContour"]].fillna(
    dataframe[["Street", "MSZoning", "LandContour"]].mean(), inplace=True
)

Step 2 - one hot encoding and categorical variables

pandas provides get_dummies, which returns pandas Dataframe, unlike ColumnTransfomer, code for this would be:

encoded = pd.get_dummies(dataframe[['MSZoning', 'LandContour']], drop_first=True)
pd.dropna(['MSZoning', 'LandContour'], axis=columns, inplace=True)
dataframe = dataframe.join(encoded)

For ordinal variables and their encoding I would suggest you to look at this SO answer (unluckily some manual mapping would be needed in this case).

If you want to use transformer anyway

Get np.array from the dataframe using values attribute, pass it through the pipeline and recreate columns and indices from the array like this:

pd.DataFrame(data=your_array, index=np.arange(len(your_array)), columns=["A", "B"])

There is one caveat of this aprroach though; you will not know the names of custom created one-hot-encoded columns (the pipeline will not do this for you).

Additionally, you could get the names of columns from sklearn's transforming objects (e.g. using categories_ attribute), but I think it would break the pipeline (someone correct me if I'm wrong).

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
  • 2
    Hi, thank you for your fast reply. What you suggest is indeed what I did before: using fillna for the missing values and get_dummies/replace for the nominal/ordinal variables. However, I thought that it would have been nicer to create a pipeline, such that all the data preprocessing is not manual and can be easily performed again on a different dataset with the same variables. On the other hand, it's true that not having access to the one-hot-encoded columns is annoying... – Giulia Jan 22 '19 at 15:10
  • I agree it would be better, though I don't see how/don't know whether it could be done in such succinct way. – Szymon Maszke Jan 22 '19 at 15:21
  • 3
    Ok, so if I understand correctly, the problem is not in `ColumnTransformer` per se, but in the fact that the transformer `SimpleImputer` gives as output `np.array`. Thus in the next encoding step I cannot indicate the columns with their name. Is that correct? I must use the corresponding indeces, isn't it? – Giulia Jan 22 '19 at 17:49
  • 1
    but using option 1 and 2 I loose tue possibility of persistence of my Model pipeline... – ricoms Jun 15 '21 at 20:12
5

Option #2

use the make_pipeline function

(Had the same Error, found this answer, than found this: Introducing the ColumnTransformer)

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'],                          # Street
               ['missing_value', 'Pave', 'Grvl'],         # Alley
               ['missing_value', 'Fa', 'TA', 'Gd', 'Ex']  # PoolQC
               ]
cat_columns_onehot = ['MSZoning', 'LandContour']

imputer_cat_pipeline = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
    (make_pipeline(SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
)

encoder_cat_pipeline = make_column_transformer(
    (OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
    (OneHotEncoder(), cat_columns_onehot),
)

cat_pipeline = Pipeline([
    ('imp_cat', imputer_cat_pipeline),
    ('cat_encoder', encoder_cat_pipeline),
])

In my own pipelines i do not have overlapping preprocessing in the column space. So i am not sure, how the transformation and than the "outer pipelining" works.

However, the important part is to use make_pipeline around the SimpleImputer to use it in a pipeline properly:

imputer_cat_pipeline = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
)
Jonas
  • 513
  • 1
  • 5
  • 17
4

Just to add to the other answers here. I'm no Python or data science expert but you can pass another pipeline to ColumnTransformer in order to do what you need an add more than one transformer to a column. I came here looking for an answer to the same question and found this solution.

Doing it all via pipelines enables you to control the test/train data a lot easier to avoid leakage, and opens up more Grid Search possibilities too. I'm personally not a fan of the pandas approach in another answer for these reasons, but it would work ok still.

encoder_cat_pipeline = Pipeline([
    ('ordinal', OrdinalEncoder(categories=ord_mapping)),
    ('pass_ord', OneHotEncoder()),
])

imputer_cat_pipeline = ColumnTransformer([
    ('imp_miss', SimpleImputer(strategy='constant'), cat_columns_fill_miss),
    ('new_pipeline', encoder_cat_pipeline, cat_columns_fill_freq)
])

cat_pipeline = Pipeline([
    ('imp_cat', imputer_cat_pipeline),
])
john
  • 1,280
  • 1
  • 18
  • 48
  • 1
    This will not work as expected like there would be original cat column would come in addition to its one hot encoded columns. – ggaurav Jan 04 '21 at 07:09
3

I like to use the FunctionTransformer sklearn offers instead of doing transformations directly in pandas whenever I am doing any transformations. The reason for this is now my feature transformations are more generalizable on new incoming data (e.g. suppose you win, and you need to use the same code to predict on next years data). This way you won't have to re-run your code, you can save your preprocessor and call transform. I use something like this

FE_pipeline = {

'numeric_pipe': make_pipeline(
    FunctionTransformer(lambda x: x.replace([np.inf, -np.inf], np.nan)),
    MinMaxScaler(),
    SimpleImputer(strategy='median', add_indicator=True),
    ),
'oh_pipe': make_pipeline(
     FunctionTransformer(lambda x: x.astype(str)),
     SimpleImputer(strategy='constant'),
     OneHotEncoder(handle_unknown='ignore')
    )
}
Matt Elgazar
  • 707
  • 1
  • 8
  • 21