Using Scikit's StandardScaler correctly across multiple programs

Question

I am having a question that is very similar to this topic but I want to reuse the StandardScaler instead of LabelEncoder. Here's what I have done:

# in one program
dict = {"mean": scaler.mean_, "var": scaler.var_}
# and save the dict 


# in another program
# load the dict first
new_scaler = StandardScaler()
new_scaler.mean_ = dict['mean'] # Hoever it doesn't work
new_scaler.var_ = dict['var'] # Doesn't work either...

I also tried set_params but it can only change these parameters: copy, with_mean, and with_std.

So, how can I re-use the scaler I got in program one? Thanks!

score 5 · Accepted Answer · edited Apr 20 '22 at 14:12

Just pickle the whole thing.

Follow the official docs.

You can either use python's standard-pickle from the first link or the specialized joblib-pickle mentioned in the second link (which i recommend; often more efficient, although not that important for this simple kind of object = scaler):

    import joblib
    import sklearn.preprocessing as skp
    
    new_scaler = skp.StandardScaler()
    # ...fit it... do something ...

    joblib.dump(new_scaler , 'my_scaler.pkl')     # save to disk

    loaded_scaler = joblib.load('my_scaler.pkl')  # load from disk

If you by any chance want to store your sklearn-objects in databases like MySQL, MongoDB, Redis and co., the above example using file-based storage won't work of course.

The easy approach then: use python-pickle's dumps which will dump to a bytes-object (ready for most DB-wrappers).

For the more efficient joblib, you have to use python's BytesIO to use it in a similar way (as the method itself is file-based, but can be used on file-like objects).

Using Scikit's StandardScaler correctly across multiple programs

1 Answers1