72

I'm using the MinMaxScaler model in sklearn to normalize the features of a model.

training_set = np.random.rand(4,4)*10
training_set

       [[ 6.01144787,  0.59753007,  2.0014852 ,  3.45433657],
       [ 6.03041646,  5.15589559,  6.64992437,  2.63440202],
       [ 2.27733136,  9.29927394,  0.03718093,  7.7679183 ],
       [ 9.86934288,  7.59003904,  6.02363739,  2.78294206]]


scaler = MinMaxScaler()
scaler.fit(training_set)    
scaler.transform(training_set)


   [[ 0.49184811,  0.        ,  0.29704831,  0.15972182],
   [ 0.4943466 ,  0.52384506,  1.        ,  0.        ],
   [ 0.        ,  1.        ,  0.        ,  1.        ],
   [ 1.        ,  0.80357559,  0.9052909 ,  0.02893534]]

Now I want to use the same scaler to normalize the test set:

   [[ 8.31263467,  7.99782295,  0.02031658,  9.43249727],
   [ 1.03761228,  9.53173021,  5.99539478,  4.81456067],
   [ 0.19715961,  5.97702519,  0.53347403,  5.58747666],
   [ 9.67505429,  2.76225253,  7.39944931,  8.46746594]]

But I don't want so use the scaler.fit() with the training data all the time. Is there a way to save the scaler and load it later from a different file?

Engineero
  • 12,340
  • 5
  • 53
  • 75
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181

5 Answers5

115

Update: sklearn.externals.joblib is deprecated. Install and use the pure joblib instead. Please see Engineero's answer below, which is otherwise identical to mine.

Original answer

Even better than pickle (which creates much larger files than this method), you can use sklearn's built-in tool:

from sklearn.externals import joblib
scaler_filename = "scaler.save"
joblib.dump(scaler, scaler_filename) 

# And now to load...

scaler = joblib.load(scaler_filename) 
Ivan Vegner
  • 1,707
  • 4
  • 14
  • 23
  • 1
    It's a good solution, but same with pickle isn' it? I'm a beginner in machine learning. – gold-kou Jul 01 '18 at 05:07
  • 3
    It is not -- `joblib.dump` is optimized for dumping sklearn objects and therefore creates much smaller files than pickle, which dumps the object with all its dependencies and such. – Ivan Vegner Jul 02 '18 at 16:30
  • 1
    My experience with `pickle` is poor: it probably works for a short-term export but over long period of time, you have to deal with protocol version (one of parameters for pickling) and I've encountered errors when loading old exports. I prefer this answer, thus. – Vojta F Feb 26 '19 at 15:23
41

So I'm actually not an expert with this but from a bit of research and a few helpful links, I think pickle and sklearn.externals.joblib are going to be your friends here.

The package pickle lets you save models or "dump" models to a file.

I think this link is also helpful. It talks about creating a persistence model. Something that you're going to want to try is:

# could use: import pickle... however let's do something else
from sklearn.externals import joblib 

# this is more efficient than pickle for things like large numpy arrays
# ... which sklearn models often have.   

# then just 'dump' your file
joblib.dump(clf, 'my_dope_model.pkl') 

Here is where you can learn more about the sklearn externals.

Let me know if that doesn't help or I'm not understanding something about your model.

Note: sklearn.externals.joblib is deprecated. Install and use the pure joblib instead

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
jlarks32
  • 931
  • 8
  • 20
  • 4
    For some reason, when I use this to save a `MinMaxScaler`, the loaded scaler doesn't scale the data identically to a freshly fitted scaler. Any idea why? – BallpointBen Jun 08 '17 at 19:53
  • @BallpointBen Just tried it on a separate test set and got the same results. Maybe you used `np.random.rand` again? – Breina Jul 02 '17 at 09:46
41

Just a note that sklearn.externals.joblib has been deprecated and is superseded by plain old joblib, which can be installed with pip install joblib:

import joblib
joblib.dump(my_scaler, 'scaler.gz')
my_scaler = joblib.load('scaler.gz')

Note that file extensions can be anything, but if it is one of ['.z', '.gz', '.bz2', '.xz', '.lzma'] then the corresponding compression protocol will be used. Docs for joblib.dump() and joblib.load() methods.

Engineero
  • 12,340
  • 5
  • 53
  • 75
21

You can use pickle, to save the scaler:

import pickle
scalerfile = 'scaler.sav'
pickle.dump(scaler, open(scalerfile, 'wb'))

Load it back:

import pickle
scalerfile = 'scaler.sav'
scaler = pickle.load(open(scalerfile, 'rb'))
test_scaled_set = scaler.transform(test_set)
Psidom
  • 209,562
  • 33
  • 339
  • 356
12

The best way to do this is to create an ML pipeline like the following:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals import joblib


pipeline = make_pipeline(MinMaxScaler(),YOUR_ML_MODEL() )

model = pipeline.fit(X_train, y_train)

Now you can save it to a file:

joblib.dump(model, 'filename.mod') 

Later you can load it like this:

model = joblib.load('filename.mod')
PSN
  • 2,326
  • 3
  • 27
  • 52
  • 2
    You can use joblib or pickle here. The point is to create a pipeline so that you don’t have to separately call the scaler. – PSN Aug 23 '19 at 09:24
  • 1
    This is _instead_ of saving the model, correct? If so, this seems like a better answer than the above, as you don't have to manage two separate files. – codehearted Aug 13 '20 at 03:37