Correctly scaling features for machine learning

Question

My dataframe contains three kinds of features:

features with observations ranging from 0 to 100
features with observations coded as 0 or 1 (0 meaning no, 1 being yes)
Features with observations ranging from 1 to 5 (A persons response to a question with 1 being strongly disagree and 5 being strongly agree)

Can I just apply StandardScaler to my dataframe and all the features will be scaled correctly? or is there a specific scaling method required for each of the different kinds of features in my dataframe?

Well, that's a point of view; another is that of teaching one how to fish instead of just giving them one single fish for today (and keep them dependable)... BTW, I have essentially provided you with the answer, in case you were too busy with the rhetorics to notice. — desertnaut, Jun 23 '20 at 17:16

score 0 · Answer 1 · edited Jun 23 '20 at 20:37

StandardScaler scales your data so that each column will have μ = 0 and σ = 1. According to the documentation:

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

As each feature is scaled independently of the other(s), their relevant magnitude difference does not over-shadow each other. It is worth noting that the scaling depends heavily on distribution of training samples for each feature. A standard normally distributed training data will result in perfect scaling. For further understanding you may go through the documentation and also see this SO thread.

Sample data is scaled in the following code:

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[100, 0, 2], [66, 1, 5], [50, 0, 4], [33, 1, 1], [0, 0, 3], [25, 0, 2], [75, 1, 4], [50, 1, 3]])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print ("Scaled Array:")
print(scaled_data)

print('Mean :', scaled_data.mean(axis=0))
print('Standard Deviation :', scaled_data.std(axis=0))
print('Variance :', scaled_data.var(axis=0))

The output is:

Scaled Array:
[[ 1.71992157 -1.         -0.81649658]
 [ 0.55329148  1.          1.63299316]
 [ 0.00428908 -1.          0.81649658]
 [-0.57902597  1.         -1.63299316]
 [-1.71134341 -1.          0.        ]
 [-0.85352716 -1.         -0.81649658]
 [ 0.86210533  1.          0.81649658]
 [ 0.00428908  1.          0.        ]]
Mean : [3.46944695e-18 0.00000000e+00 0.00000000e+00]
Standard Deviation : [1. 1. 1.]
Variance : [1. 1. 1.]

Correctly scaling features for machine learning

1 Answers1