5

I am trying to standardize some data to be able to apply PCA to it. I am using sklearn.preprocessing.StandardScaler. I am having trouble to understand the difference between using True or False in the parameters with_mean and with_std (documentation).

Can someone give a more extended explanation?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
alvaro otero
  • 51
  • 1
  • 2
  • 1
    If you set `with_mean`/`with_std` to `False`, it means it will use `0`/`1` as mean/std dev instead of measuring these on the data first. If both are thus set to `False`, you use a [*standard normal distribution*](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution). – Willem Van Onsem Aug 04 '19 at 20:26

2 Answers2

5

I have provided more details in this thread, but let me just explain this here as well.

The standardation of the data (each column/feature/variable indivivually) involves the following equations:

enter image description here


Explanation:

If you set with_mean and with_std to False, then the mean μ is set to 0 and the std to 1, assuming that the columns/features are coming from the normal gaussian distribution (which has 0 mean and 1 std).

If you set with_mean and with_std to True, then you will actually use the true μ and σ of your data. This is the most common approach.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • 1
    Hi, could you please include an example with the answer showing the difference between with_mean and with_std set to False/True? It will help me in clearing my understanding further. I am sorry for the trouble. – learner Oct 24 '20 at 03:34
3

A standard scaler is usually used to fit a normal distribution with the data, and then calculate the Z-scores. This thus means that first the mean μ and standard deviation σ of the data are calculated, and then the Z-scores are calculated with z = (x - μ) / σ.

By setting with_mean or with_std to False, we respectively set the mean μ to 0 and the standard deviation σ to 1. If both are set to False, we thus calculate the Z-score of a standard normal distribution [wiki].

The main use case of setting with_mean to False is processing sparse matrices. Sparse matrices contain a significant amount of zeros, and are therefore stored in a way that the zeros usually use no (or very little) memory. If we would fit the mean, and then calculate the z-score, it is almost certain that all zeros will be mapped to non-zero values, and thus use (significant amounts of) memory. For large sparse matrices, that can result in a memory error: the data is that large, that the memory is not able to store the matrix anymore. By setting μ=0, this means that values that are zero, will map on zero. The result of the standard scaler is a sparse matrix with the same shape.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555