1

I'm trying to classify cells into populations. When I use:

gmix = mixture.GMM(n_components=3, covariance_type='full') gmix.fit(samples)

The means output, from the code below, changes in order, unless I set: np.radom.seed(0).

print ("gmix.means \n", gmix.means_) colors = ['r' if i==0 else ('g' if i==1 else ('b' if i ==2 else 'm'))for i in gmix.predict(samples)]

I would like the classes sorted by the X axis mean (first item of each class) ie:

[[  3.25492404e+02   2.88403293e-02]  
[  3.73942908e+02   3.25283512e-02] 
[  5.92577646e+02   4.40595768e-02]]

So in the code above red would always be 325, green 372 and blue 592. At the moment I don't think there is anything sorting the output.

I tried:

gmix.means_ = np.sort(gmix.means_, axis = 0)

But then the gmix.covars_ and gmix.weights_ also need to be sorted accordingly, which is where I'm stuck!

Many thanks!

Edit 4/5/16:

Thanks for the help and steering me in the right direction. Here is my poorly written but working version:

    sort_indices = gmix.means_.argsort(axis = 0)
    order = sort_indices[:, 0]
    print('\norder:', order)
    gmix.means_ = gmix.means_[order,:]    

    gmix.covars_ = gmix.covars_[order, :]
    print ("\n sorted gmix.covars \n", gmix.covars_) 

    print ("\n\nori gmix.weights \n", gmix.weights_)
    w = np.split(gmix.weights_,3)
    w = np.asarray(w)
    w = np.ravel(w[order,:])
    gmix.weights_ = w
Edward Burgin
  • 176
  • 14

3 Answers3

1

I was looking for the same feature. Here is my solution, based on @ed3203 code:

def fit_predict_by(clf, X, order_function):
    """
    Sort `clf.fit_predict` by given attribute.

    It ensure that all calls to fit predict will return an array
    sorted by the given attribute. In addition, the `clf` attributes
    `means_`, `covars_`, and `weights_` are also sorted similarly.

    ## Usage

        # Sort by cluster weights
        y = fit_predict_by(clf, X, lambda clf: clf.weights_.argsort())
        # or sort by the `x` value of the mean
        y = fit_predict_by(clf, X, lambda clf: clf.means_.argsort()[:, 0])
    """
    y = clf.fit_predict(X)
    order = order_function(clf)

    for attr in ('means_', 'covars_', 'weights_'):
        sorted_attr = getattr(clf, attr)[order]
        setattr(clf, attr, sorted_attr)

    ensure_no_overlap = len(order)
    for new_val, old_val in enumerate(order):
        y[y == old_val] = new_val + ensure_no_overlap
    return y - ensure_no_overlap
Nagasaki45
  • 2,634
  • 1
  • 22
  • 27
0

This is basically a matrix/vector indexing problem. I'm probably being too verbose here, but it should be just two lines to sort your matrices.

Clustering algorithms in general (GMM in your case) are not guaranteed to label the clusters in the same order every time, neither are they guaranteed to give you the same clusters every time, unless you fix the initial conditions.

If you want the clusters sorted by their X-coordinate of their means, you probably may need to do this yourself. This involves 2 steps, just like you mentioned in your question:

a) Sort the means and get the indices b) Use the indices to extract your means out

This can be done simply as follows:

a) Do an argsort on your means

>>> means = np.array(np.mat('1, 2; 4, 3; 2, 6'))
>>> sort_indices = means.argsort(axis=0)
array([[0, 0],
       [2, 1],
       [1, 2]])

Your order would be the first column of the argsorted array:

>>> order = sort_indices[:,0]
>>> order
array([0, 2, 1])

(b) Now, we will use this 'order' to reorder your means.

>>> sorted_m = means[order,:]
>>> sorted_m

array([[1, 2],
       [2, 6],
       [4, 3]])

and your covariances, let us create a dummy covariance matrix:

>>> c = np.array(np.mat('9, 8, 7; 6, 5, 4; 3, 2, 1'))
>>> c
array([[9, 8, 7],
       [6, 5, 4],
       [3, 2, 1]])

Now, reindex your c, and an easy way is to just reindex:

>>> sorted_c = c[order,:][:, order]
>>> sorted_c
array([[9, 7, 8],
       [3, 1, 2],
       [6, 4, 5]])

If you see, the rows and columns are rearranged according to our new order.

There you have it, bot your means and covariances sorted.

You may need to relabel your original labels as well, for which you can use the answer here: Fast replacement of values in a numpy array

Community
  • 1
  • 1
user1669710
  • 224
  • 1
  • 11
  • Thanks for the help and steering me in the right direction. Here is my poorly written but working version: ` sort_indices = gmix.means_.argsort(axis = 0) order = sort_indices[:, 0] print('\norder:', order) gmix.means_ = gmix.means_[order,:] gmix.covars_ = gmix.covars_[order, :] print ("\n sorted gmix.covars \n", gmix.covars_) print ("\n\nori gmix.weights \n", gmix.weights_) w = np.split(gmix.weights_,3) w = np.asarray(w) w = np.ravel(w[order,:]) gmix.weights_ = w – Edward Burgin May 04 '16 at 15:16
0

As scikit-learn version is 0.23.1 the right way is to reorder precisions_ and precisions_cholesky_ too. Also, covars_ is now covariances_. So for 1D version you should do so:

order = best_gmm.means_.argsort(axis=0)[:, 0]
best_gmm.means_ = best_gmm.means_[order]
best_gmm.covariances_ = best_gmm.covariances_[order]
best_gmm.weights_ = best_gmm.weights_[order]
best_gmm.precisions_ = best_gmm.precisions_[order]
best_gmm.precisions_cholesky_ = best_gmm.precisions_cholesky_[order]