How to store scaling parameters for later use

Question

I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.

How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?

I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?

score 12 · Accepted Answer · answered Mar 11 '16 at 16:22

I think that the best way is to pickle it post fit, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.

Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:

scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data. New in version 0.17: scale_ is recommended instead of deprecated std_. mean_ : array of floats with shape [n_features] The mean value for each feature in the training set.

The following short snippet illustrates this:

from sklearn import preprocessing
import numpy as np

s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))

There should be a way to construct the standard scaler with the parameters that were saved from the previous fitting. — CMCDragonkai, Aug 28 '19 at 11:59

score 7 · Answer 2 · answered Aug 19 '19 at 10:36

Scale with standard scaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

save mean_ and var_ for later use

means = scaler.mean_ 
vars = scaler.var_

(you can print and copy paste means and vars or save to disk with np.save....)

Later use of saved parameters

def scale_data(array,means=means,stds=vars **0.5):
    return (array-means)/stds

scale_new_data = scale_data(new_data)

score 5 · Answer 3 · answered Sep 24 '20 at 17:09

5

You can use the joblib module to store the parameters of your scaler.

from joblib import dump
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
dump(scaler, 'scaler_filename.joblib')

Later you can load the scaler.

from joblib import load
scaler = load('scaler_filename.joblib')
transformed_data = scaler.transform(new_data)

answered Sep 24 '20 at 17:09

Galo Castillo

324
3
7

You can use this, but is it better in some way? Is this part of standard library and does the new pickle alleviate some of the concerns that joblib tried to address in the first place? Idk, but that information would be helpful. – wellplayed Feb 10 '22 at 21:41

zhukovgreen · Answer 4 · 2022-02-11T09:07:56.313

Pickle brings a security vulnerability and allows attackers to execute arbitrary code on the servers. The conditions are:

possibility to replace the pickle file with another pickle file on the server (if no auditing of the pickle performed, i.e. signature validation or hash comparison)
the same but on the developer PC (attacker compromised some dev PC

If your server-side applications are executed as root (or under root in docker containers), then this is definitely worth of your attention.

Possible solution:

Model training should be done in a secure environment
Trained models should be signed by the key from another secure environment, which is not loaded to the gpg-agent (otherwise the attacker can quite easily replace the signature)
CI should test the models in an isolated environment (quarantine)
Use python3.8 or later which added security hooks to prevent code injection techniques
or just avoid pickle:)

Some links:

Possible approach to avoid pickling:

# scaler is fitted instance of MinMaxScaler
scaler_data_ = np.array([scaler.data_min_, scaler.data_max_])
np.save("my_scaler.npy", allow_pickle=False, scaler_data_)

#some not scaled X
Xreal = np.array([1.9261148646249848, 0.7327923702472628, 118, 1083])

scaler_data_ = np.load("my_scaler.npy")
Xmin, Xmax = scaler_data_[0], scaler_data_[1]
Xscaled = (Xreal - Xmin) / (Xmax-Xmin)
Xscaled
# -> array([0.63062502, 0.35320565, 0.15144766, 0.69116555])

is pickle a bad idea due to security issues? Or is there another reason? — ClimateUnboxed, Nov 30 '20 at 08:23
@AdrianTompkins yes, because it is possible to replace the pickle object with any other one and execute it on the host machine. Plus pickling creates unnecessary restrictions in terms of compatibility of pickling protocols — zhukovgreen, Nov 30 '20 at 08:26
@AdrianTompkins thx for bringing this up. I actually had a mistake in the answer. Now it is fixed. Added allow_pickle=False:) and provided a reference in numpy source — zhukovgreen, Nov 30 '20 at 08:35
@wellplayed thanks for your note. I tried to be more transparent in describing my position. I updated the answer. If you disagree with it please provide your thoughts, arguments — zhukovgreen, Feb 11 '22 at 09:09

How to store scaling parameters for later use

4 Answers4

Linked