5

I'm stuck with the problem of scaling new data. In my scheme, I have trained and test the model, with all x_train and x_test have been scaled using sklearn.MinMaxScaler(). Then, applying to the real-time process, how can I scale the new input in the same scale of the training and testing data. The step is as below

featuresData = df[features].values # Array of all features with the length of thousands
sc = MinMaxScaler(feature_range=(-1,1), copy=False)
featuresData = sc.fit_transform(featuresData)

#Running model to make the final model
model.fit(X,Y)
model.predict(X_test)

#Saving to abcxyz.h5

Then implementing with new data

#load the model abcxyz.h5
#catching new data 
#Scaling new data to put into the loaded model << I'm stucking in this step
#...

So how to scale the new data to predict then inverse transform to the final result? From my logic, it need to scale in the same manner of the old scaler before training the model.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
ShanN
  • 831
  • 1
  • 9
  • 20

3 Answers3

10

From the way you used scikit-learn, you need to have had saved the transformer:

import joblib
# ...
sc = MinMaxScaler(feature_range=(-1,1), copy=False)
featuresData = sc.fit_transform(featuresData)

joblib.dump(sc, 'sc.joblib') 

# with new data
sc = joblib.load('sc.joblib')
transformData = sc.transform(newData)
# ...

The best way to use scikit-learn is merging your transformations with your model. That way, you only save your model that includes the transformation pipe.

from sklearn import svm
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline


clf = svm.SVC(kernel='linear')
sc = MinMaxScaler(feature_range=(-1,1), copy=False)

model = Pipeline([('scaler', sc), ('svc', clf)])

#...

When you do model.fit, first the model will do fit_transform for your scaler under the hood. With model.predict, the transform of your scaler will be involved.

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
1

Consider the following example:

data1 = np.array([0, 1, 2, 3, 4, 5])
data2 = np.array([0, 2, 4, 6, 8, 10])

sc = MinMaxScaler()
sc.fit_transform(data1.reshape(-1, 1))

Output:

array([[0. ],
       [0.2],
       [0.4],
       [0.6],
       [0.8],
       [1. ]])

The second data set will give you the same values after scaling:

sc.fit_transform(data2.reshape(-1, 1))

Output:

array([[0. ],
       [0.2],
       [0.4],
       [0.6],
       [0.8],
       [1. ]])

Let's fit on the first data set and use the same scaler for the second one:

sc.fit(data1.reshape(-1, 1))
sc.transform(data2.reshape(-1, 1)) 

Output:

array([[0. ],
       [0.4],
       [0.8],
       [1.2],
       [1.6],
       [2. ]])
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
0

You should use fit() and transform() for do that as follows:

# Lets say you read real times data as new_data

featuresData = df[features].values
sc = MinMaxScaler(feature_range=(-1,1), copy=False)
featuresData = sc.fit_transform(featuresData)
new_data = sc.transform(new_data)

sc.transform will apply same scale on new_data which you applied on featuresData.

talatccan
  • 743
  • 5
  • 19
  • Are there any method to saving the scale parameter from the 1st time scaling? So I don't need to fit_transform again and again each time have a new data – ShanN Jan 03 '20 at 09:09
  • Let's say you have 100 as a max value in the first input. You get 1 after scaling. What value will you get after scaling if your max value in the second input is equal to 1000? – Mykola Zotko Jan 03 '20 at 09:19
  • 1
    It will not scale it within the given range. And scaled value for 1000 will be more than top of given scale range for MinMaxScaler. @MykolaZotko – talatccan Jan 03 '20 at 09:42