The supplied model is not a clustering estimator in YellowBrick

Question

I am trying to visualize an elbow plot for my data using YellowBrick's KElbowVisualizer and SKLearn's Expectation Maximization algorithm class: GaussianMixture.

When I run this, I get the error in the title. (I have also tried ClassificationReport, but that fails as well)

model = GaussianMixture()

data = get_data(data_name, preprocessor_name, train_split=0.75)
X, y, x_test, y_test = data

visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

I cannot find anything in YellowBrick to help me estimate the number of components for expectation maximization.

score 13 · Accepted Answer · answered Nov 09 '19 at 13:28

Yellowbrick uses the sklearn estimator type checks to determine if a model is well suited to the visualization. You can use the force_model param to bypasses the type checking (though it seems that the KElbow documentation needs to be updated with this).

However, even though force_model=True gets you through the YellowbrickTypeError it still does not mean that GaussianMixture works with KElbow. This is because the elbow visualizer is set up to work with the centroidal clustering API and requires both a n_clusters hyperparam and a labels_ learned param. Expectation maximization models do not support this API.

However, it is possible to create a wrapper around the Gaussian mixture model that will allow it to work with the elbow visualizer (and a similar method could be used with the classification report as well).

from sklearn.base import ClusterMixin
from sklearn.mixture import GaussianMixture
from yellowbrick.cluster import KElbow
from yellowbrick.datasets import load_nfl

class GMClusters(GaussianMixture, ClusterMixin):

    def __init__(self, n_clusters=1, **kwargs):
        kwargs["n_components"] = n_clusters
        super(GMClusters, self).__init__(**kwargs)

    def fit(self, X):
        super(GMClusters, self).fit(X)
        self.labels_ = self.predict(X)
        return self 


X, _ = load_nfl()
oz = KElbow(GMClusters(), k=(4,12), force_model=True)
oz.fit(X)
oz.show()

This does produce a KElbow plot (though not a great one for this particular dataset):

Another answer mentioned Calinksi Harabasz scores, which you can use in the KElbow visualizer as follows:

oz = KElbow(GMClusters(), k=(4,12), metric='calinski_harabasz', force_model=True)
oz.fit(X)
oz.show()

Creating the wrapper isn't ideal, but for model types that don't fit the standard classifier or clusterer sklearn APIs, they are often necessary and it's a good strategy to have in your back pocket for a number of ML tasks.

This is a great answer, thank you! It should really be in the YB docs — James L., Nov 11 '19 at 17:48
This is great! I don't know if the API changed since this was written, but I got good results by skipping the `__init__()` override, and adding kwargs key overrides to `set_params()` and `get_params()`...with that, I was able to use all of the scoring metrics, including distortion. Added as a separate answer, but this could just be edited I guess... — Chris Vandevelde, Mar 31 '21 at 05:24

score 3 · Answer 2 · answered Mar 31 '21 at 05:28

Buiding on @bbengfort's great answer, I used:

class GaussianMixtureCluster(GaussianMixture, ClusterMixin):
    """Subclass of GaussianMixture to make it a ClusterMixin."""

    def fit(self, X):
        super().fit(X)
        self.labels_ = self.predict(X)
        return self

    def get_params(self, **kwargs):
        output = super().get_params(**kwargs)
        output["n_clusters"] = output.get("n_components", None)
        return output

    def set_params(self, **kwargs):
        kwargs["n_components"] = kwargs.pop("n_clusters", None)
        return super().set_params(**kwargs)

This lets you use any scoring metric, and works with the latest version of YellowBrick.

Michael Bridges · Answer 3 · 2019-11-04T03:50:30.073

2

You can use the sklearn calinski_harabasz_score- see the relevant docs here.

scores = pd.DataFrame()
components = 100
for n in range(2,components):
    model = GaussianMixture(n_components=n)
    y = model.fit_predict(X)
    scores.loc[n,'score'] = calinski_harabasz_score(X,y)
plt.plot(scores.reset_index()['index'],scores['score'])

Something like this should give similar functionality.

edited Nov 04 '19 at 03:50

answered Oct 31 '19 at 21:33

Michael Bridges

351
2
6

The supplied model is not a clustering estimator in YellowBrick

3 Answers3