4

Hypertools seems like a great toolset for analyzing some high dimensional data quickly.

In particular, you can take some data, throw it in hypertools.plot(...) and get something nice looking out.

However, I'm having trouble reproducing the groups afterward.

In theory, hypertools.plot(data, reduce="alg", cluster="alg2") should be roughly equivalent to:

data = np.array(...)
reduced = hypertools.analyze(data, reduce="alg")
labels = hypertools.cluster(reduced, cluster="alg2")
hypertools.plot(reduced, hue=labels)

But I'm seeing wildly different labels from the step-by-step approach compared to hypertools.plot(...).

Is there a way to get the same clusters out without plotting? Can I extract the clusters from the return value of hypertools.plot(..) (not preferred as sometimes my Python doesn't realize the plot has been closed so the return value is never realized)?

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
Logan
  • 1,614
  • 1
  • 14
  • 27

1 Answers1

2

Looking at the hypertools.plot source code, it looks like the issue is the fact that mixing reduce and cluster inside the same call to plot reduces to at most three dimensions first, if smaller is not specified, then clusters. When you take the step-by-step approach, the dimensionality is not reduced to three until you plot after you have already clustered. Limiting the dimensions with ndims=3 in the analyze function of step-by-step approach produces the same results as the one-liner you want.

So the answer to your question 'Is there a way to get the same clusters out without plotting?' would be to pass ndims=3 to the analyze function.

From plot.py (hypertools v0.6.2):

# reduce data to 3 dims for plotting, if ndims is None, return this
    if (ndims and ndims < 3):
        xform = reducer(xform, ndims=ndims, reduce=reduce, internal=True)
    else:
        xform = reducer(xform, ndims=3, reduce=reduce, internal=True)

    # find cluster and reshape if n_clusters
    if cluster is not None:
        if hue is not None:
            warnings.warn('cluster overrides hue, ignoring hue.')
        if isinstance(cluster, (six.string_types, six.binary_type)):
            model = cluster
            params = default_params(model)
        elif isinstance(cluster, dict):
            model = cluster['model']
            params = default_params(model, cluster['params'])
        else:
            raise ValueError('Invalid cluster model specified; should be'
                             ' string or dictionary!')

        if n_clusters is not None:
            if cluster in ('HDBSCAN',):
                warnings.warn('n_clusters is not a valid parameter for '
                              'HDBSCAN clustering and will be ignored.')
            else:
                params['n_clusters'] = n_clusters

        cluster_labels = clusterer(xform, cluster={'model': model,
                                               'params': params})
        xform, labels = reshape_data(xform, cluster_labels, labels)
        hue = cluster_labels

Example using the mushroom sample data set:

import hypertools
import numpy as np
%matplotlib inline

geo = hypertools.load('mushrooms')
data = geo.get_data()
reduced = hypertools.analyze(data, ndims=3, reduce="SparsePCA")
labels = hypertools.cluster(reduced, cluster="Birch")
hypertools.plot(reduced, '.', hue=labels)

enter image description here

Gives the same results as:

hypertools.plot(data, '.', reduce="SparsePCA", cluster="Birch")

enter image description here

Compared to step-by-step without passing ndims=3 to analyze: enter image description here

cwalvoort
  • 1,851
  • 1
  • 18
  • 19