16

So, I have this doubt and have been looking for answers. So the question is when I use,

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()

df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})

df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)

After which I will train and test the model (A,B as features, C as Label) and get some accuracy score. Now my doubt is, what happens when I have to predict the label for new set of data. Say,

df = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})

Because when I normalize the column the values of A and B will be changed according to the new data, not the data which the model will be trained on. So, now my data after the data preparation step that is as below, will be.

data[['A','B']] = min_max_scaler.fit_transform(data[['A','B']])

Values of A and B will change with respect to the Max and Min value of df[['A','B']]. The data prep of df[['A','B']] is with respect to Min Max of df[['A','B']].

How can the data preparation be valid with respect to different numbers relate? I don't understand how the prediction will be correct here.

seralouk
  • 30,938
  • 9
  • 118
  • 133
Tia
  • 521
  • 2
  • 6
  • 18
  • You will kind of have to use the same scaler if you want to use the trained model..save the scaler and reapply it. – Uvar May 28 '18 at 12:06

2 Answers2

57

You should fit the MinMaxScaler using the training data and then apply the scaler on the testing data before the prediction.


In summary:

  • Step 1: fit the scaler on the TRAINING data
  • Step 2: use the scaler to transform the TRAINING data
  • Step 3: use the transformed training data to fit the predictive model
  • Step 4: use the scaler to transform the TEST data
  • Step 5: predict using the trained model (step 3) and the transformed TEST data (step 4).

Example using your data:

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
#training data
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
#fit and transform the training data and use them for the model training
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)

#fit the model
model.fit(df['A','B'])

#after the model training on the transformed training data define the testing data df_test
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})

#before the prediction of the test data, ONLY APPLY the scaler on them
df_test[['A','B']] = min_max_scaler.transform(df_test[['A','B']])

#test the model
y_predicted_from_model = model.predict(df_test['A','B'])

Example using iris data:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

data = datasets.load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = SVC()
model.fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
y_pred = model.predict(X_test_scaled)

Hope this helps.

See also by post here: https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • That helped a lot Thank you, I would like to know on what basis the transform() works on the new dataframe? `df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])` Here the minmax scaler does the data preparation using the formula **Xnorm = X-Xmin/Xmax-Xmin** `df_test[['A','B']] = min_max_scaler.transform(df_test[['A','B']])` Here however, It's not making use of that formula, so, how exactly are the data scaled here? using what formula? The output for the above is, – Tia May 29 '18 at 06:40
  • The output for `df_test= pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})` `df_test[['A','B']]=min_max_scaler.transform(df_test[['A','B']])` is `A = [1.60,4.40,1.53,5.00,1.46] B = [-0.125,3.125,1.125,4.437,0.937]` How is the data scaling happening here? Because it's not using _**Xnorm = X-Xmin/Xmax-Xmin**_ formula for data scaling. – Tia May 29 '18 at 06:53
  • 1
    It is using `X_min` and `X_max` from the training set (the one that was used to fit `min_max_scaler`) – FlorianGD Jun 27 '18 at 08:40
  • 1
    @Tia here, the `Xmin` and `Xmax` are calculated from the training set, then the training data are normalized and finally the same values are used to normalize the testing data. – seralouk Jun 27 '18 at 09:05
  • If I understand correctly, the last line of the iris example should be `y_pred = model.predict(X_test_scaled)`, shouldn't it? – Guillaume Ansanay-Alex Aug 30 '18 at 14:59
  • 5
    @seralouk What if the y_train data has a big scale too? how should it be corrected after the prediction? What is the best approach in that case? – cdvv7788 Sep 13 '18 at 03:30
  • 4
    Does MinMaxScaler assume the training data will contain the maximum value for X.?What if my testing set, or a future data set for predicting contains a a value much larger than what the MinMaxScaler was fit with? Will it be able to handle that? – csteel Mar 06 '19 at 17:36
  • that should not be a problem. It may happen but nothing will change. – seralouk Mar 06 '19 at 20:13
2

Best way is train and save MinMaxScaler model and load the same when it's required.

Saving model:

df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])  
pickle.dump(min_max_scaler, open("scaler.pkl", 'wb'))

Loading saved model:

scalerObj = pickle.load(open("scaler.pkl", 'rb'))
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
df_test[['A','B']] = scalerObj.transform(df_test[['A','B']])
vipin bansal
  • 878
  • 11
  • 10