0

Assume that I have a train dataset. I split it into train / test. For training, I use standard scaler to fit.transform on train data and transform on test data. Then, I train a model and save it.

train.py:

data = pd.read_csv("train.csv")
X = data["X"]
y = data["y"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

scale = StandardScaler()
X_train_s = scale.fit_transform(X_train)
X_test_s = scale.transform(X_test)

model.fit(X_train_s, y_train)
y_pred = model.predcit(X_test_s)

# save model
joblib.dump(model, filename)

Now, I load the model in another script, and I have another dataset only for prediction. Question is how to scale prediction dataset when I don't have train dataset. Is it correct to fit.transform on prediction dataset as below?

prediction.py

data = pd.read_csv("prediction.csv")
X = data["X"]
y = data["y"]

scale = StandardScaler()
X_predict_s = scale.fit_transform(X)

loaded_model = joblib.load(filename)
y_pred = loaded_model(X_predict_s)

Or I have to load train data into prediction.py and use it to fit.transform scaler.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Mohammad
  • 775
  • 1
  • 14
  • 37

1 Answers1

1

I like using pickle, but the same logic applies to joblib.

In essence, you have to dump your scaler and load it in the new script, just like you did with model and loaded_model.

In the script where you trained the model:

from pickle import dump

# save model
dump(model, open('model.pkl', 'wb'))
# save scaler
dump(scale, open('scale.pkl', 'wb'))

In the script where you load the model:

from pickle import load

# load model
loaded_model = load(model, open('model.pkl', 'rb'))
# load scaler
loaded_scale = load(scale, open('scale.pkl', 'rb'))

Now you have to transform your data using loaded_scale and predict on the scaled data using loaded_model.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
  • We are both guilty of not looking for (expected) duplicates - deleted my own answer (despite upvoted)... – desertnaut Mar 25 '21 at 14:43
  • I thought your answer was good. Is there a policy for deleting answers on duplicates? Based on my experience, it's nice to have answered duplicates because it's easier to find a solution to the same issue (because there's more links available upon a single google search). – Arturo Sbr Mar 25 '21 at 14:59
  • No, there is not such policy; but there is a reasonable unwritten guideline that one can *either* answer a question *or* vote to close it, but not both. So, since I closed the question myself as a duplicate, my own answer had to go (not yours). Other than that, there is a general guideline that, especially for such questions (which we should expect they have already been answered), we should look for duplicates before answering them, and if we find, close them as such instead of answering them. All in all, there is no action expected from you here, and your answer is fine. – desertnaut Mar 25 '21 at 17:22
  • "*it's nice to have answered duplicates because [...] more links available*" - yes, that's exactly the rationale for not deleting duplicate questions and closing them instead as such, but the "answered" part of your argument does not hold. – desertnaut Mar 25 '21 at 17:25