2

When I use sklearn MinMaxScaler(), I noticed some interesting behavior which shown in the following code.

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler(feature_range=(0, 1))
>>> scaler.fit(data)
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> test_data = [[-22, 20], [20.5, 26], [30, 40], [19, 13]]
>>> scaler.transform(test_data)
array([[-10.5   ,   1.125 ],
       [ 10.75  ,   1.5   ],
       [ 15.5   ,   2.375 ],
       [ 10.    ,   0.6875]])

I noticed that when I transform the test_data with fitted MinMaxScaler(), it returns values beyond the defined range (0 - 1).

Now, I intentionally make the test_data to be outside the value range of "data", to test the output of MinMaxScaler().

I thought that when the "test_data" has a value which is beyond the value range in the variable "data", it should return some error. But then, this is not the case, and I got an output value beyond the defined range.

My question is, why does the function exhibit this behavior (i.e. return an output value beyond the defined range, when the test_data value is beyond the value range in the data in which MinMaxScaler is being fitted), instead of returning an error?

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Glorian
  • 127
  • 1
  • 1
  • 10

2 Answers2

3

MinMaxScaler throwing an error (and thus terminating program execution) in cases where the resulting (transformed) data are outside the feature_range provided during fitting would arguably be a bad & weird design choice.

Consider a scenario of a real-world pipeline processing some hundreds of thousands incoming data samples on a periodic basis with such a scaler being a part of it. Imagine that the scaler does indeed throw an error and stops if any transformed feature falls outside the range [0, 1]. Now consider a case where, in a batch of, say, 500K data samples, there are just a couple of which the features, after transformation, come up indeed out of the [0, 1] range. So, the whole pipeline just breaks up...

Who might be happy in such a scenario? (tentative answer: nobody).

Could the responsible data scientist or ML engineer possibly claim "but why, this is the correct thing to do, since there are obviously bad data"? No, not by a long shot...


The notion of concept drift, i.e. the unforeseeable changes in the underlying distribution of streaming data over time, is a huge ML sub-topic of great practical interest and an area of intense research. The idea here (i.e. behind such functions not throwing errors in these cases) is that, if the modeler has reasons to believe that something like that might happen in practice (it almost always does), hence rendering their ML results largely useless, it is their own responsibility to deal with it explicitly in their deployed systems. Leaving such a serious job on the shoulders of a (humble...) scaling function would be largely inappropriate, and, at the end of the day, a mistake.

Generalizing the discussion a bit: MimMaxScaler is just a helper function; the underlying assumption of using it (as the whole of scikit-learn and similar libraries, in fact) is that we know what we are doing, and we are not just mindless dummies randomly turning knobs and pressing buttons until our models seem to "work". Should Keras warn us when we try something really meaningless, like requesting the classification accuracy in a regression problem? Well, it does not - a minimum of knowledge is certainly assumed to exist when using it, and we should not really expect the frameworks themselves to protect us from such mistakes in our own modeling.

Similarly here, it is our job to be aware of the possibility of getting out-of-range values for transformed new data, and to handle the situation accordingly; it is not the job of MinMaxScaler (or any other similar transformer) to halt the process on that behalf.


Returning to your own toy example, or to my own hypothetical one: it is always possible to integrate additional logic after the transformation of new data, so that such cases are handled accordingly; even just checking which (and how many) samples are problematic is arguably infinitely easier after such a transformation than before (thus providing a very first, crude alert of possible concept drift). By not throwing an error (and thus halting the whole process), scikit-learn gives to you, the modeler, all the options to proceed as you see fit, provided again that you know your stuff. Just throwing an error and refusing to continue would not be productive here, and the design choice of the scikit-learn developers seems highly justified.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
1

If you treat the MinMaxScaler as yet another ML model, then you just got a bad test score. It's the same as, say, an R squared of 0.01 on the test set - the test data differs from the train data so much that the model failed to produce good results for it.

Now, why does SciKit-Learn not raise an error when your model's accuracy is 0.07? Or when the value of the loss function is off the charts? Probably because it can't know what fitness score is bad enough (or what fitness score to use, even).

Also, the fit function computes the minimum and maximum to be used for later scaling, so you "trained" the "model" on your train data, which essentially computed and stored its minimum and maximum value. When you transformed the train data, the following formula was run (see link above):

X_scaled = X_std * (max - min) + min

Here, max and min are for the train data, and only X_scaled involves the data you're applying the "model" to. So of course the model gave "incorrect" predictions - because the min and max for the test set were different from the ones used for "training" the model.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
  • OP's question is not why sklearn does not return an error when the accuracy is too low but why it does not so "*when the test_data value is beyond the value range in the data in which MinMaxScaler is being fitted*". – desertnaut Dec 06 '20 at 13:39
  • 1
    @desertnaut, this is explained in the last half of my answer - it's because the `min` and `max` values are coming from the _train_ dataset – ForceBru Dec 06 '20 at 13:41
  • I mean that your 2nd paragraph looks irrelevant (this was never the question), and somewhat weird IMHO. Plus, the term "incorrect predictions" you use in the last part is rather unfortunate - they are *not* incorrect (given what the function is supposed to do). – desertnaut Dec 06 '20 at 13:42
  • (cont.) if we consider that the function gives "incorrect" predictions, then shouldn't it raise an error, as OP implicitly suggests? (answer: no, but it is exactly the explanation why not that is arguably required by the OP here). – desertnaut Dec 06 '20 at 13:54
  • @desertnaut, yes, the predictions can only be judged as "incorrect" or "far from optimal" based on some fit score, right? But what score do you use and at what value of this score do you raise an error? What score is "bad enough"? That's the point of the second paragraph - sklearn can't select a score that makes sense for all models and a threshold value of that score that triggers the error – ForceBru Dec 06 '20 at 14:00
  • I don't follow; I cannot see any mention of a score in the question, only in your answer. And yes, in the question the predictions are implied to be "incorrect" simply because they return values outside an expected range, w/o any reference to a specific score. – desertnaut Dec 06 '20 at 14:12
  • @desertnaut, in a nutshell, I'm proposing to treat `MinMaxScaler` as any other ML model, and I think sklearn does the same. Raising an error when the model "returns an output value beyond the defined range" makes sense for `MinMaxScaler`, but it may not make sense in general, so it looks like sklearn just lets it spit out data computed by the formula, as is the case with any other ML model. Garbage in - garbage out. But if OP wanted to get the error, they'd need a way of estimating how wrong the model is and thus use a score. – ForceBru Dec 06 '20 at 14:21
  • 1
    They don't need a score for that - just checking if (any) returned values are outside the `feature_range` provided when fitting the scaler would do the job; that's why your 2nd para is irrelevant here - it addresses a different question not actually asked. OP's question is *why* sklearn does not do so and does not throw an error when this happens. It's a question on the *design* of the function and its rationale (or so it seems to me...) – desertnaut Dec 06 '20 at 15:03