47

I want to use sklearn's StandardScaler. Is it possible to apply it to some feature columns but not others?

For instance, say my data is:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

   Age  Name  Weight
0   18     3      68
1   92     4      59
2   98     6      49


col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

I fit and transform the data

scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)

       Name       Age    Weight
0 -1.069045 -1.411004  1.202703
1 -0.267261  0.623041  0.042954
2  1.336306  0.787964 -1.245657

But of course the names are not really integers but strings and I don't want to standardize them. How can I apply the fit and transform methods only on the columns Age and Weight?

Janosh
  • 3,392
  • 2
  • 27
  • 35
mitsi
  • 1,005
  • 2
  • 11
  • 15
  • I would like to answer a better solution: The accepted answer does not preserve column names and is therefore poor. Instead this on liner should be used: data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']]) – – Philipp Schwarz Apr 12 '22 at 17:00

6 Answers6

61

Introduced in v0.20 is ColumnTransformer which applies transformers to a specified set of columns of an array or pandas DataFrame.

import pandas as pd
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

col_names = ['Name', 'Age', 'Weight']
features = data[col_names]

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer([
        ('somename', StandardScaler(), ['Age', 'Weight'])
    ], remainder='passthrough')

ct.fit_transform(features)

NB: Like Pipeline it also has a shorthand version make_column_transformer which doesn't require naming the transformers

Output

-1.41100443,  1.20270298,  3.       
 0.62304092,  0.04295368,  4.       
 0.78796352, -1.24565666,  6.       
Guy C
  • 6,970
  • 5
  • 30
  • 30
  • 2
    This is now the best answer (doesn't require you to copy a data frame) – kellyfj Feb 05 '19 at 17:18
  • 6
    Nice answer ! How couId preserve the column names if I did this with a pandas dataframe ? Is there a way without having to rename all columns at the end ? – DataBach Apr 23 '20 at 13:37
  • This is what I was looking for, best answer and faster, although using apply is also one alternate. – user3065757 Jul 06 '20 at 09:14
  • 1
    The accepted answer does not preserve column names and is therefore poor. Instead use this on liner: `data[['Age', 'Weight']] = StandardScaler().fit_transform(data[['Age', 'Weight']])` – Philipp Schwarz Apr 12 '22 at 16:55
  • Either column names **or** column order needs to be preserved, otherwise it's very cumbersome to use it. Right now, the `passthrough` columns are appended to the end **and** their names are removed, so it's hard to deal with the resulting object. – pcko1 Dec 09 '22 at 12:03
  • To preserve column names and order see answers to [this question](https://stackoverflow.com/q/68874492/11764049) – Aelius Feb 14 '23 at 10:18
46

Update:

Currently the best way to handle this is to use ColumnTransformer as explained here.


First create a copy of your dataframe:

scaled_features = data.copy()

Don't include the Name column in the transformation:

col_names = ['Age', 'Weight']
features = scaled_features[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)

Now, don't create a new dataframe but assign the result to those two columns:

scaled_features[col_names] = features
print(scaled_features)


        Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • It works but I am unable to use the 'inverse_transform' function to obtain the initial values with this method. 'test = scaled_features.iloc[1,:]' 'test_inverse = scaler.inverse_transform(test)' I got the error : ValueError: operands could not be broadcast together with shapes (3,) (2,) (3,) – mitsi Jul 17 '16 at 13:01
  • 1
    `scaler.inverse_transform(scaled_features[col_names].values)` works for me. – ayhan Jul 17 '16 at 13:06
  • I was trying to test the `inverse_transform` function with the first row. Yes it works for me too but I'm losing the column `names`. I could insert it if I (re)convert the all dataframe. But what if I want to `inverse_transform` only the first line ? – mitsi Jul 17 '16 at 13:22
  • Excuse me if I haven't been clear but when I mention column `name` i design the column containing the names (the 2nd column of the dataframe, the one that I don't want to scaled) not the names of the columns – mitsi Jul 17 '16 at 13:41
  • Yes (not necessarily the first row, but a new line with the same structure) – mitsi Jul 17 '16 at 13:49
8

Late to the party, but here's my preferred solution:

#load data
data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})

#list for cols to scale
cols_to_scale = ['Age','Weight']

#create and fit scaler
scaler = StandardScaler()
scaler.fit(data[cols_to_scale])

#scale selected data
data[cols_to_scale] = scaler.transform(data[cols_to_scale])
Alex
  • 1,064
  • 1
  • 11
  • 16
3

The easiest way I find is:

from sklearn.preprocessing import StandardScaler
# I'm selecting only numericals to scale
numerical = temp.select_dtypes(include='float64').columns
# This will transform the selected columns and merge to the original data frame
temp.loc[:,numerical] = StandardScaler().fit_transform(temp.loc[:,numerical])

Output

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657
Addy
  • 31
  • 2
2

Another option would be to drop Name column before scaling then merge it back together:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})
from sklearn.preprocessing import StandardScaler

# Save the variable you don't want to scale
name_var = data['Name']

# Fit scaler to your data
scaler.fit(data.drop('Name', axis = 1))

# Calculate scaled values and store them in a separate object
scaled_values = scaler.transform(data.drop('Name', axis = 1))

data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)
data['Name'] = name_var

print(data)
Danil
  • 99
  • 1
  • 5
0

A more pythonic way to do this -

from sklearn.preprocessing import StandardScaler
data[['Age','Weight']] = data[['Age','Weight']].apply(
                           lambda x: StandardScaler().fit_transform(x))
data 

Output -

         Age  Name    Weight
0 -1.411004     3  1.202703
1  0.623041     4  0.042954
2  0.787964     6 -1.245657
hashcode55
  • 5,622
  • 4
  • 27
  • 40
  • "How can I apply the fit and transform functions only on the columns Age and Weight". I was not aware that the OP wanted to do those things. – hashcode55 Jul 17 '16 at 14:37