0

I am getting a SettingWithCopyWarning from Pandas when performing the below operation. I understand what the warning means and I know I can turn the warning off but I am curious if I am performing this type of standardization incorrectly using a pandas dataframe (I have mixed data with categorical and numeric columns). My numbers seem fine after checking but I would like to clean up my syntax to make sure I am using Pandas correctly.

I am curious if there is a better workflow for this type of operation when dealing with data sets that have mixed data types like this.

My process is as follows with some toy data:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List

# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
                                ['1',125,'A',15],
                                ['2',134,'A',20],
                                ['3',112,'A',25],
                                ['4',107,'B',35],
                                ['5',68,'B',50],
                                ['6',321,'B',10],
                                ['7',26,'B',27],
                                ['8',115,'C',64],
                                ['9',100,'C',72],
                                ['10',74,'C',18],
                                ['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id        object
weight     int64
type      object
age        int64
dtype: object

# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()

# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only 
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])


<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Coldchain9
  • 1,373
  • 11
  • 31

2 Answers2

1

Your X_train, X_test are still slices of the original dataframe. Modifying a slice triggers the warning and often doesn't work.

You can either transform before train_test_split, else do X_train = X_train.copy() after split, then transform.

The second approach would prevent information leak as commented in your code. So something like this:

# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy()    # don't you drop the label?
# y: pd.Series = df.pop('type')  # y = df['type']

# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'], 
                                                    test_size = 0.2, 
                                                    random_state = 0)

# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()

## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)

X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Seems intuitive, thanks. Is this a typical workflow for preparing data for modeling? I can't imagine its super uncommon that people have mixed data types (categorical and numeric)? – Coldchain9 Jan 22 '21 at 01:37
  • Yes, it's typical, at least to me :-). – Quang Hoang Jan 22 '21 at 01:39
  • Now if I wanted to add a ```pd.get_dummies()``` step into this workflow, where would you put it? I assume something like ```df[features] = pd.get_dummies(df[features], columns=cat_cols)``` – Coldchain9 Jan 22 '21 at 01:40
0

I try to explain both pd.get_dummies() and OneHotEncoder() for transforming categorical data into dummy columns. But I do recommend using OneHotEncoder() transformer, because it's a sklearn transformer that you can use it in a Pipeline later if you want.

First OneHotEncoder(): It does the same job as pd.get_dummies function of pandas does, but return of this class is a Numpy ndarray or a sparse array. you can read more about this class here:

from sklearn.preprocessing import OneHotEncoder

X_train_cat = X_train[["type"]]
cat_encoder = OneHotEncoder(sparse=False)
X_train_cat_1hot = cat_encoder.fit_transform(X_train) #This is a numpy ndarray!
#If you want to make a DataFrame again, you can do so like below:
#X_train_cat_1hot = pd.DataFrame(X_train_cat_1hot, columns=cat_encoder.categories_[0])
#You can also concatenate this transformed dataframe with your numerical transformed one.

Second method, pd.get_dummies():

df_dummies = pd.get_dummies(X_train[["type"]])
X_train = pd.concat([X_train, df_dummies], axis=1).drop("type", axis=1)
ashkangh
  • 1,594
  • 1
  • 6
  • 9