20

I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.

How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?

I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?

Ioannis Nasios
  • 8,292
  • 4
  • 33
  • 55
LetsPlayYahtzee
  • 7,161
  • 12
  • 41
  • 65

4 Answers4

12

I think that the best way is to pickle it post fit, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.

Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:

scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data. New in version 0.17: scale_ is recommended instead of deprecated std_. mean_ : array of floats with shape [n_features] The mean value for each feature in the training set.

The following short snippet illustrates this:

from sklearn import preprocessing
import numpy as np

s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • 6
    There should be a way to construct the standard scaler with the parameters that were saved from the previous fitting. – CMCDragonkai Aug 28 '19 at 11:59
7

Scale with standard scaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

save mean_ and var_ for later use

means = scaler.mean_ 
vars = scaler.var_    

(you can print and copy paste means and vars or save to disk with np.save....)

Later use of saved parameters

def scale_data(array,means=means,stds=vars **0.5):
    return (array-means)/stds

scale_new_data = scale_data(new_data)
Ioannis Nasios
  • 8,292
  • 4
  • 33
  • 55
5

You can use the joblib module to store the parameters of your scaler.

from joblib import dump
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
dump(scaler, 'scaler_filename.joblib')

Later you can load the scaler.

from joblib import load
scaler = load('scaler_filename.joblib')
transformed_data = scaler.transform(new_data)
Galo Castillo
  • 324
  • 3
  • 7
  • You can use this, but is it better in some way? Is this part of standard library and does the new pickle alleviate some of the concerns that joblib tried to address in the first place? Idk, but that information would be helpful. – wellplayed Feb 10 '22 at 21:41
2

Pickle brings a security vulnerability and allows attackers to execute arbitrary code on the servers. The conditions are:

  • possibility to replace the pickle file with another pickle file on the server (if no auditing of the pickle performed, i.e. signature validation or hash comparison)

  • the same but on the developer PC (attacker compromised some dev PC

If your server-side applications are executed as root (or under root in docker containers), then this is definitely worth of your attention.

Possible solution:

  • Model training should be done in a secure environment

  • Trained models should be signed by the key from another secure environment, which is not loaded to the gpg-agent (otherwise the attacker can quite easily replace the signature)

  • CI should test the models in an isolated environment (quarantine)

  • Use python3.8 or later which added security hooks to prevent code injection techniques

  • or just avoid pickle:)

Some links:

Possible approach to avoid pickling:

# scaler is fitted instance of MinMaxScaler
scaler_data_ = np.array([scaler.data_min_, scaler.data_max_])
np.save("my_scaler.npy", allow_pickle=False, scaler_data_)

#some not scaled X
Xreal = np.array([1.9261148646249848, 0.7327923702472628, 118, 1083])

scaler_data_ = np.load("my_scaler.npy")
Xmin, Xmax = scaler_data_[0], scaler_data_[1]
Xscaled = (Xreal - Xmin) / (Xmax-Xmin)
Xscaled
# -> array([0.63062502, 0.35320565, 0.15144766, 0.69116555])
zhukovgreen
  • 1,551
  • 16
  • 26
  • is pickle a bad idea due to security issues? Or is there another reason? – ClimateUnboxed Nov 30 '20 at 08:23
  • 1
    @AdrianTompkins yes, because it is possible to replace the pickle object with any other one and execute it on the host machine. Plus pickling creates unnecessary restrictions in terms of compatibility of pickling protocols – zhukovgreen Nov 30 '20 at 08:26
  • 1
    @AdrianTompkins thx for bringing this up. I actually had a mistake in the answer. Now it is fixed. Added allow_pickle=False:) and provided a reference in numpy source – zhukovgreen Nov 30 '20 at 08:35
  • 1
    @wellplayed thanks for your note. I tried to be more transparent in describing my position. I updated the answer. If you disagree with it please provide your thoughts, arguments – zhukovgreen Feb 11 '22 at 09:09