Assume that I have a train dataset. I split it into train / test. For training, I use standard scaler to fit.transform on train data and transform on test data. Then, I train a model and save it.
train.py:
data = pd.read_csv("train.csv")
X = data["X"]
y = data["y"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
scale = StandardScaler()
X_train_s = scale.fit_transform(X_train)
X_test_s = scale.transform(X_test)
model.fit(X_train_s, y_train)
y_pred = model.predcit(X_test_s)
# save model
joblib.dump(model, filename)
Now, I load the model in another script, and I have another dataset only for prediction. Question is how to scale prediction dataset when I don't have train dataset. Is it correct to fit.transform on prediction dataset as below?
prediction.py
data = pd.read_csv("prediction.csv")
X = data["X"]
y = data["y"]
scale = StandardScaler()
X_predict_s = scale.fit_transform(X)
loaded_model = joblib.load(filename)
y_pred = loaded_model(X_predict_s)
Or I have to load train data into prediction.py
and use it to fit.transform scaler.