0

I have been using an online ML course with code-along style videos. In one of the videos where it was required to implement feature scaling, the instructor said it was necessary to create 2 separate standard scaler objects - 1 for the feature (there was only 1 feature) and 1 for the label - to scale the data. However, this doesn't make sense to me because as far as my intuition goes, calling the fit_transform method a second time will completely overwrite the previous fit. I asked this question on that course's Q&A forum but the people answering questions there said if we didn't create 2 separate objects it would somehow scale the data under a combined effect of the features and the label, which, again, is something I can't wrap my head around.

I tested this and at least as far as this dataset is concerned, my intuition was correct:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
y_sc = scaler.fit_transform(y)

print(y_sc)

The output I got was:

[[-0.72004253]
 [-0.70243757]
 [-0.66722767]
 [-0.59680786]
 [-0.49117815]
 [-0.35033854]
 [-0.17428902]
 [ 0.17781001]
 [ 0.88200808]
 [ 2.64250325]]

And then using 2 separate scaler objects:

scaler_x = StandardScaler()
scaler_y = StandardScaler()
X_sc2 = scaler_x.fit_transform(X)
y_sc2 = scaler_y.fit_transform(y)

print(y_sc2)

The output:

[[-0.72004253]
 [-0.70243757]
 [-0.66722767]
 [-0.59680786]
 [-0.49117815]
 [-0.35033854]
 [-0.17428902]
 [ 0.17781001]
 [ 0.88200808]
 [ 2.64250325]]

The same answer. To 8 decimal places.

I still don't know if I'm correct though, because the teaching assistants on that forum seem strangely adamant on their explanation. They haven't replied since I showed them screenshots of my output. I know this is something very trivial and it might come across as immature that I'm pursuing such a small thing to such a great extent but it really gets on my nerves when people go around spreading either wrong or partially correct information. I am completely open to the idea that I could be wrong too but to arrive at that conclusion I would also want a proper explanation as to why my understanding is wrong.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Your intuition is indeed correct - `fit_transform` indeed re-fits the scaler (and how else could it be?). But what your instructors have told you is also correct - we do use 2 separate scalers for the feature & labels, the reason being that we need to *save* and keep them both in order to use them later with the *test* (or even more later, with the *real*) data, which would be impossible if we had just re-fit them. But in any case, this is not a *programming* question, hence it is off-topic here - please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut Jul 18 '23 at 06:23
  • See here for a toy example and another reason why we need the scalers kept & saved for use in *prediction* time: https://stackoverflow.com/questions/48973140/how-to-interpret-mse-in-keras-regressor/49009442#49009442 – desertnaut Jul 18 '23 at 06:26
  • 1
    @desertnaut thank you for your reply. Also, I apologize for this being off-topic, I just signed up so I didn't know. Thank you for the links! – yes-im-here Jul 18 '23 at 06:31

1 Answers1

0

The fit_transform calls fits the data first and then transforms, so you can use one Scaler for as many different calls as you like, but this might not make sense and is inefficient. You also only ever fit on the current data. So two samples from your data are scaled with different parameters (mean, variance).

It depends on the use case but normally you do not want this for machine learning topics. You want one mean and variance over your whole dataset and use them, so you need to fit the scaler on all data first and then can apply it on individual samples.

This is also more efficient, since the transform call is fast, the fit call might not.

import numpy as np
from profilehooks import profile
from sklearn.preprocessing import StandardScaler


@profile()
def one_scaler(x, y):
    scaler = StandardScaler()
    x_sc = scaler.fit_transform(x)
    y_sc = scaler.fit_transform(y)
    return x_sc, y_sc


@profile()
def fixed_scalers(x, y, scaler_x: StandardScaler, scaler_y: StandardScaler):
    return scaler_x.transform(x), scaler_y.transform(y)


# Create scalers on "all" data
all_x = np.random.randint(0, 100, size=(100000, 1))
all_y = np.random.randint(100, 1000, size=(100000, 1))
scaler_x = StandardScaler().fit(all_x)
scaler_y = StandardScaler().fit(all_y)

for i in range(1000):
    # Use scalers on a subset of the data
    x = np.random.randint(0, 100, size=(10, 1))
    y = np.random.randint(100, 1000, size=(10, 1))
    x_sc_1, y_sc_1 = one_scaler(x, y)
    x_sc_2, y_sc_2 = fixed_scalers(x, y, scaler_x, scaler_y)

Here I get the following runtimes. You can see the one_scaler version needs around 3 times more time. Where the fit call takes 0.621s in total. There is obviously the overhead for the fixed version to calculate the mean and std for all data first, but you can do it once and save the values and use the to init the scaler with them.

fixed_scalers (...)
function called 1000 times

         390000 function calls (386000 primitive calls) in 0.267 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 76 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.001    0.000    0.266    0.000 tmp.py:14(fixed_scalers)
     2000    0.003    0.000    0.265    0.000 _set_output.py:138(wrapped)
     2000    0.015    0.000    0.256    0.000 _data.py:986(transform)
one_scaler (...)
function called 1000 times

         1195000 function calls (1185000 primitive calls) in 0.880 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 145 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.002    0.000    0.880    0.001 tmp.py:6(one_scaler)
4000/2000    0.005    0.000    0.878    0.000 _set_output.py:138(wrapped)
     2000    0.003    0.000    0.871    0.000 base.py:887(fit_transform)
     2000    0.002    0.000    0.621    0.000 _data.py:812(fit)
     2000    0.009    0.000    0.617    0.000 base.py:1134(wrapper)
     2000    0.020    0.000    0.471    0.000 _data.py:839(partial_fit)
     4000    0.018    0.000    0.410    0.000 base.py:508(_validate_data)
     4000    0.050    0.000    0.371    0.000 validation.py:646(check_array)
     2000    0.015    0.000    0.237    0.000 _data.py:986(transform)

Nopileos
  • 1,976
  • 7
  • 17
  • Thank you! I was just confused about whether or not I can (theoretically) make do with 1 standard scaler object instead of creating one for the features and the other one for the label. Thanks again for your answer, it made me look at this from the efficiency pov – yes-im-here Jul 19 '23 at 05:48
  • Please upvote and accept answers if you think your questions was answered. – Nopileos Jul 19 '23 at 06:19