0

After running kmeans I can easily get an array with the assigned clusters for ever data point. Now I want to get a membership matrix (one-hot array) which has the different clusters as columns and indicates the cluster assignment by either 1 or 0 in the matrix for each data point.

My code is shown below and it works but I am wondering if there is a more elegant way to do the same.

km = KMeans(n_clusters=3).fit(data)
membership_matrix = np.stack([np.where(km.labels_ == 0, 1,0),
                              np.where(km.labels_ == 1, 1,0),
                              np.where(km.labels_ == 2, 1,0)]
                              axis = 1)
Daniel F
  • 13,620
  • 2
  • 29
  • 55
Oli4
  • 24
  • 4

3 Answers3

0

Here's a method that's agnostic to the number of clusters you have (with your method, you'll have to "stack" more things if you have more clusters).

This code sample assumes you have six data points and 3 clusters:

NUM_DATA_POINTS = 6
NUM_CLUSTERS = 3
clusters = np.array([2,1,2,2,0,1]) # hard-coded as an example, but this is your KMeans output

# create your empty membership matrix
membership = np.zeros((NUM_DATA_POINTS, NUM_CLUSTERS)) 
membership[np.arange(NUM_DATA_POINTS), clusters] = 1

The key feature being used here is 2D array indexing - in the last line of code above, we index into the rows of membership sequentially (np.arange creates an incrementing sequence from 0 to NUM_DATA_POINTS-1) and into the columns of membership using the cluster assignments. Here's the relevant numpy reference.

It would produce the following membership matrix:

>>> membership
array([[ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])   
rbedi100
  • 11
  • 2
0

So you can create 'one-hot array' which is equivalent to your membership array from array of cluster according to this question. Here is how you do it using np.eye

import numpy as np

clusters = np.array([2,1,2,2,0,1])
n_clusters = max(clusters) + 1
membership_matrix = np.eye(n_clusters)[clusters]

Output is as follows

array([[ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])
Daniel F
  • 13,620
  • 2
  • 29
  • 55
titipata
  • 5,321
  • 3
  • 35
  • 59
0

You are looking for LabelBinarizer. Give this code a try:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
membership_matrix = lb.fit_transform(km.labels_)

In contrast to other solutions proposed here, this approach:

  • Generates a compact membership matrix when the labels are not consecutive numbers.
  • Is able to deal with categorical labels.

Sample run:

In [9]: lb.fit_transform([0, 1, 2, 0, 2, 2])
Out[9]: 
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1]])

In [10]: lb.fit_transform([0, 1, 9, 0, 9, 9])
Out[10]: 
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1]])

In [11]: lb.fit_transform(['first', 'second', 'third', 'first', 'third', 'third'])
Out[11]: 
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1]])
Tonechas
  • 13,398
  • 16
  • 46
  • 80