StandardScaler difference between "with_std=False or True" and "with_mean=False or True"

Question

I am trying to standardize some data to be able to apply PCA to it. I am using sklearn.preprocessing.StandardScaler. I am having trouble to understand the difference between using True or False in the parameters with_mean and with_std (documentation).

Can someone give a more extended explanation?

If you set `with_mean`/`with_std` to `False`, it means it will use `0`/`1` as mean/std dev instead of measuring these on the data first. If both are thus set to `False`, you use a [*standard normal distribution*](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution). — Willem Van Onsem, Aug 04 '19 at 20:26

score 5 · Answer 1 · edited Jun 23 '20 at 22:32

5

I have provided more details in this thread, but let me just explain this here as well.

The standardation of the data (each column/feature/variable indivivually) involves the following equations:

Explanation:

If you set with_mean and with_std to False, then the mean μ is set to 0 and the std to 1, assuming that the columns/features are coming from the normal gaussian distribution (which has 0 mean and 1 std).

If you set with_mean and with_std to True, then you will actually use the true μ and σ of your data. This is the most common approach.

edited Jun 23 '20 at 22:32

desertnaut

57,590
26
140
166

answered Aug 06 '19 at 18:05

seralouk

30,938
9
118
133

1

Hi, could you please include an example with the answer showing the difference between with_mean and with_std set to False/True? It will help me in clearing my understanding further. I am sorry for the trouble. – learner Oct 24 '20 at 03:34

score 3 · Answer 2 · answered Aug 04 '19 at 20:41

A standard scaler is usually used to fit a normal distribution with the data, and then calculate the Z-scores. This thus means that first the mean μ and standard deviation σ of the data are calculated, and then the Z-scores are calculated with z = (x - μ) / σ.

By setting with_mean or with_std to False, we respectively set the mean μ to 0 and the standard deviation σ to 1. If both are set to False, we thus calculate the Z-score of a standard normal distribution [wiki].

The main use case of setting with_mean to False is processing sparse matrices. Sparse matrices contain a significant amount of zeros, and are therefore stored in a way that the zeros usually use no (or very little) memory. If we would fit the mean, and then calculate the z-score, it is almost certain that all zeros will be mapped to non-zero values, and thus use (significant amounts of) memory. For large sparse matrices, that can result in a memory error: the data is that large, that the memory is not able to store the matrix anymore. By setting μ=0, this means that values that are zero, will map on zero. The result of the standard scaler is a sparse matrix with the same shape.

StandardScaler difference between "with_std=False or True" and "with_mean=False or True"

2 Answers2

Linked