0

Background

I am currently using the kmodes python package to perform unsupervised learning on data that includes categorical parameters.

I need to be able to save these models, as I am planning to use it in a production pipeline where I wish to be able to "roll back" to older, working models if something in the pipeline fails.

Requirements

I can use any file format, including HDF5 format. I am also not wedded to kmodes, however I do need to be able to handle mixed categorical and numerical data.


Help

I cannot seem to find any way that I can save the full kmodes model to disk, but I'm hoping that I'm just missing something obvious. Please provide any potential options.

Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
  • Please provide the reason for downvote. Is the question unclear? There is no need for sample data, for instance. It seems both self sufficient and self evident. – Mike Williamson Mar 24 '18 at 02:30
  • Can you provide an example? In @chthonicdaemon example the data returned by KModes is a simple and highly correlated numpy array, which can be very efficiently saved in a compressed HDF5-Format. – max9111 Apr 18 '18 at 10:45
  • @MikeWilliamson I would appreciate some additional comments on why one of the answers isn't good enough... – chthonicdaemon Apr 19 '18 at 17:12
  • 1
    @chthonicdaemon Your answer was good enough. I got sidetracked with other work and didn't come back to this page for a few days. Thanks so much! Very helpful, in fact!! – Mike Williamson Apr 24 '18 at 01:27

3 Answers3

8

Let's start with the example clustering from the project's README:

import numpy as np
from kmodes.kmodes import KModes

# random categorical data
data = np.random.choice(20, (100, 10))

km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

We can now save this using the pickle module:

import pickle

# It is important to use binary access
with open('km.pickle', 'wb') as f:
    pickle.dump(km, f)

To read back the object, use

with open('km.pickle', 'rb') as f:
    km = pickle.load(f)
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
1

It appears that the kmodes and kprototypes classes inherit from scikit learn’s BaseEstimator. In sklearn, you can save/load a trained model via standard serialization, using pickle.

Here’s a link to sklearn docs on saving a model using pickle or the serialization code from joblib: http://scikit-learn.org/stable/modules/model_persistence.html

Does this answer address your problem? Are the kmodes models not serializable in your application?

svohara
  • 2,159
  • 19
  • 17
1

You are looking for the Python pickle library.

The pickle module implements an algorithm for turning an arbitrary Python object into a series of bytes. This process is also called serializing” the object. The byte stream representing the object can then be transmitted or stored, and later reconstructed to create a new object with the same characteristics.

I think this would be a very helpful resource for you in implementing it.

Another library to look into includes cPickle. Why?

First, cPickle can be up to 1000 times faster than pickle because the former is implemented in C.

Given you are needing to save your models to disk, it probably means you model is decently big. Time is a priority - and this will save you a ton of time.

Second, in the cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of the cPickle module.

So it depends on your program and required functionality. A good example of using cPickle can be found here

cacti5
  • 2,006
  • 2
  • 25
  • 33
  • Thanks, @Anna! I also appreciate the `pickle` v. `cPickle` comparison. I chose @chthon's for the bounty, though, since it provided an example. – Mike Williamson Apr 26 '18 at 23:22