Apply StandardScaler to parts of a data set

Question

I want to use sklearn's StandardScaler. Is it possible to apply it to some feature columns but not others?

For instance, say my data is:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

   Age  Name  Weight
0   18     3      68
1   92     4      59
2   98     6      49


col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

I fit and transform the data

scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)

       Name       Age    Weight
0 -1.069045 -1.411004  1.202703
1 -0.267261  0.623041  0.042954
2  1.336306  0.787964 -1.245657

But of course the names are not really integers but strings and I don't want to standardize them. How can I apply the fit and transform methods only on the columns Age and Weight?

I would like to answer a better solution: The accepted answer does not preserve column names and is therefore poor. Instead this on liner should be used: data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']]) – — Philipp Schwarz, Apr 12 '22 at 17:00

Guy C · Accepted Answer · 2019-01-23T09:49:00.537

61

Introduced in v0.20 is ColumnTransformer which applies transformers to a specified set of columns of an array or pandas DataFrame.

import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer([
        ('somename', StandardScaler(), ['Age', 'Weight'])
    ], remainder='passthrough')

ct.fit_transform(features)

NB: Like Pipeline it also has a shorthand version make_column_transformer which doesn't require naming the transformers

Output

-1.41100443,  1.20270298,  3.       
 0.62304092,  0.04295368,  4.       
 0.78796352, -1.24565666,  6.

edited Jan 23 '19 at 09:49

answered Jan 23 '19 at 08:21

Guy C

6,970
5
30
30

2

This is now the best answer (doesn't require you to copy a data frame) – kellyfj Feb 05 '19 at 17:18
6

Nice answer ! How couId preserve the column names if I did this with a pandas dataframe ? Is there a way without having to rename all columns at the end ? – DataBach Apr 23 '20 at 13:37
This is what I was looking for, best answer and faster, although using apply is also one alternate. – user3065757 Jul 06 '20 at 09:14
1

The accepted answer does not preserve column names and is therefore poor. Instead use this on liner: `data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']])` – Philipp Schwarz Apr 12 '22 at 16:55
Either column names **or** column order needs to be preserved, otherwise it's very cumbersome to use it. Right now, the `passthrough` columns are appended to the end **and** their names are removed, so it's hard to deal with the resulting object. – pcko1 Dec 09 '22 at 12:03
To preserve column names and order see answers to [this question](https://stackoverflow.com/q/68874492/11764049) – Aelius Feb 14 '23 at 10:18

ayhan · Answer 2 · 2019-06-19T15:42:58.267

46

Update:

Currently the best way to handle this is to use ColumnTransformer as explained here.

First create a copy of your dataframe:

scaled_features = data.copy()

Don't include the Name column in the transformation:

col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)

Now, don't create a new dataframe but assign the result to those two columns:

scaled_features[col_names] = features
print(scaled_features)


        Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657

edited Jun 19 '19 at 15:42

answered Jul 17 '16 at 12:03

ayhan

70,170
20
182
203

It works but I am unable to use the 'inverse_transform' function to obtain the initial values with this method. 'test = scaled_features.iloc[1,:]' 'test_inverse = scaler.inverse_transform(test)' I got the error : ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,) – mitsi Jul 17 '16 at 13:01
1

`scaler.inverse_transform(scaled_features[col_names].values)` works for me. – ayhan Jul 17 '16 at 13:06
I was trying to test the `inverse_transform` function with the first row. Yes it works for me too but I'm losing the column `names`. I could insert it if I (re)convert the all dataframe. But what if I want to `inverse_transform` only the first line ? – mitsi Jul 17 '16 at 13:22
Excuse me if I haven't been clear but when I mention column `name` i design the column containing the names (the 2nd column of the dataframe, the one that I don't want to scaled) not the names of the columns – mitsi Jul 17 '16 at 13:41
Yes (not necessarily the first row, but a new line with the same structure) – mitsi Jul 17 '16 at 13:49

score 8 · Answer 3 · answered Aug 27 '21 at 18:53

Late to the party, but here's my preferred solution:

#load data
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

#list for cols to scale
cols_to_scale = ['Age','Weight']

#create and fit scaler
scaler = StandardScaler()
scaler.fit(data[cols_to_scale])

#scale selected data
data[cols_to_scale] = scaler.transform(data[cols_to_scale])

score 3 · Answer 4 · answered Jun 10 '21 at 17:08

The easiest way I find is:

from sklearn.preprocessing import StandardScaler
# I'm selecting only numericals to scale
numerical = temp.select_dtypes(include='float64').columns
# This will transform the selected columns and merge to the original data frame
temp.loc[:,numerical] = StandardScaler().fit_transform(temp.loc[:,numerical])

Output

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657

Danil · Answer 5 · 2018-06-26T14:09:13.417

Another option would be to drop Name column before scaling then merge it back together:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
from sklearn.preprocessing import StandardScaler

# Save the variable you don't want to scale
name_var = data['Name']

# Fit scaler to your data
scaler.fit(data.drop('Name', axis = 1))

# Calculate scaled values and store them in a separate object
scaled_values = scaler.transform(data.drop('Name', axis = 1))

data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)
data['Name'] = name_var

print(data)

score 0 · Answer 6 · answered Jul 17 '16 at 14:07

0

A more pythonic way to do this -

from sklearn.preprocessing import StandardScaler
data[['Age','Weight']] = data[['Age','Weight']].apply(
                           lambda x: StandardScaler().fit_transform(x))
data

Output -

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657

answered Jul 17 '16 at 14:07

hashcode55

5,622
4
27
40

"How can I apply the fit and transform functions only on the columns Age and Weight". I was not aware that the OP wanted to do those things. – hashcode55 Jul 17 '16 at 14:37

Apply StandardScaler to parts of a data set

6 Answers6

Output

Update:

Output

Linked

Related