4

I tried the Robustscaler in sklearn, and found the results are not the same as the formula.

The formula of the Robustscaler in sklearn is:

Figure 1. The formula to calculate Robustscaler

I have a matrix shown as below:

Figure 2. The test matrix

I test the first data in feature one (row one and column one). The scaled value should be (1-3)/(5.5-1.5) = -0.5. However, the result from the sklearn is -0.67. Does anyone know where the calculation is not correct?

The code using sklearn is as below:

import numpy as np
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
ZH. Yang
  • 43
  • 5

1 Answers1

4

From the RobustScaler documentation (emphasis added):

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

i.e. the median and IQR quantities are calculated per column, and not for the whole array.

Having clarified that, let's calculate the scaled values for your first column manually:

import numpy as np

x1 = np.array([1, 4, 7, 2]) # your 1st column here

q75, q25 = np.percentile(x1, [75 ,25])
iqr = q75 - q25

x1_med = np.median(x1)

x1_scaled = (x1-x1_med)/iqr
x1_scaled
# array([-0.66666667,  0.33333333,  1.33333333, -0.33333333])

which is the same with the first column of your own x_new, as calculated by scikit-learn:

# your code verbatim:
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
# result
[[-0.66666667 -0.375      -0.35294118 -0.33333333]
 [ 0.33333333  0.375       0.35294118  0.33333333]
 [ 1.33333333  1.125       1.05882353  1.        ]
 [-0.33333333 -0.625      -0.82352941 -1.        ]]

np.all(x1_scaled == x_new[:,0])
# True

Similarly for the rest of the columns (features) - you need to calculate separately the median and IQR values for each one of them before scaling them.

UPDATE (after comment):

As pointed out in the Wikipedia entry on quartiles:

For discrete distributions, there is no universal agreement on selecting the quartile values

See also the relevant reference, Sample quantiles in statistical packages:

There are a large number of different definitions used for sample quantiles in statistical computer packages

Digging into the documentation of np.percentile used here, you'll see that there are no less that five (5) different methods of interpolation, and not all of them produce identical results (see also the 4 different methods demonstrated in the Wikipedia entry linked just above); here is a quick demonstration of these methods and their results in the x1 data defined above:

np.percentile(x1, [75 ,25]) # interpolation='linear' by default
# array([4.75, 1.75])

np.percentile(x1, [75 ,25], interpolation='lower')
# array([4, 1])

np.percentile(x1, [75 ,25], interpolation='higher')
# array([7, 2])

np.percentile(x1, [75 ,25], interpolation='midpoint')
# array([5.5, 1.5])

np.percentile(x1, [75 ,25], interpolation='nearest')
# array([4, 2])

Apart from the fact that there are no two methods producing identical results, it should also be apparent that the definition you are using in your own calculations corresponds to interpolation='midpoint', while the default Numpy method is interpolation='linear'. And as Ben Reiniger correctly points out in the comments below, what is actually used in the source code of RobustScaler is np.nanpercentile (a variation pf np.percentile I have used here that is able to handle nan values) with the default interpolation='linear' setting.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Hi @desertnaut, Thanks for the answer. I still have a little question about q75 and q25. The first column is [1,4,7,2]. The q25 should be the median of the first two elements of this array, which is (1+2)/2 = 1.5. And, the q75 should be the median of the last two elements, which is (4+7)/2 = 5.5. Then iqr is 5.5-1.5=4. The median is (2+4)/2=3. Then x_scale is (1-3)/(5.5-1.5)=-0.5. The difference between this one and sklearn's is for the parameters: q25 and q75. For my calculation of iqr: it is referenced to: https://en.wikipedia.org/wiki/Interquartile_range Hope your further suggestions. – ZH. Yang Feb 06 '21 at 16:57
  • 1
    Just to note: RobustScaler calls `numpy.nanpercentile`, [here](https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/preprocessing/_data.py#L1376), leaving the interpolation to the default. – Ben Reiniger Feb 06 '21 at 18:17
  • @desertnaut Thanks for the update. If the RobustScaler is replaced by Normalizer, the x_sacle I calculated is still not the same as the sklearn's results. The formula I applied is x_i/||x||_2. For example, the first element of first feature (column) is 1. The second norm of the first column is sqrt(1+16+49+4)=8.3666. The x_scale for this point is 1/8.3666=0.1195. However, the result using Normalizer in sklearn is 0.18257419. What's the reason for that? Thanks in advance. – ZH. Yang Feb 06 '21 at 20:36
  • @BenReiniger the code is given as below: from sklearn.preprocessing import Normalizer x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]] scaler = Normalizer() x_new = scaler.fit_transform(x) print(x_new) – ZH. Yang Feb 06 '21 at 20:39
  • @ZH.Yang `Normalizer` operates on rows, not columns. – Ben Reiniger Feb 06 '21 at 20:56