48

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?

Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.

Is there a function to give the cluster id and it will list out all the data points in that cluster?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user77005
  • 1,769
  • 4
  • 18
  • 26

6 Answers6

51

I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.

data = pd.read_csv('filename')

km = KMeans(n_clusters=5).fit(data)

cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_

Once the DataFrame is available is quite easy to filter, For example, to filter all data points in cluster 3

cluster_map[cluster_map.cluster == 3]
Praveen
  • 2,137
  • 1
  • 18
  • 21
  • 2
    there is no need to use pandas – seralouk Jun 11 '18 at 18:44
  • 3
    When learning new models, I seem to struggle with this last part of returning the modeled data back to the original source. Most tutorials do not show that. Thank you for your answer. – user76595 Oct 05 '18 at 21:14
  • @Praveen Are you sure that it is going to be indexed correctly? Does your solution preserve order of rows when reconstructing dataframe from `km.labels_` as it was before clustering? – PeterB Nov 04 '18 at 18:42
22

If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:

from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X)

Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):

def ClusterIndicesNumpy(clustNum, labels_array): #numpy 
    return np.where(labels_array == clustNum)[0]

def ClusterIndicesComp(clustNum, labels_array): #list comprehension
    return np.array([i for i, x in enumerate(labels_array) if x == clustNum])

Let's say you want all samples that are in cluster 2:

ClusterIndicesNumpy(2, km.labels_)
array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
       115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
       134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])

Numpy wins the benchmark:

%timeit ClusterIndicesNumpy(2,km.labels_)

100000 loops, best of 3: 4 µs per loop

%timeit ClusterIndicesComp(2,km.labels_)

1000 loops, best of 3: 479 µs per loop

Now you can extract all of your cluster 2 data points like so:

X[ClusterIndicesNumpy(2,km.labels_)]

array([[ 6.9,  3.1,  4.9,  1.5], 
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6.3,  3.3,  6. ,  2.5], 
       ... #truncated

Double-check the first three indices from the truncated array above:

print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]

[ 6.9  3.1  4.9  1.5] 2
[ 6.7  3.   5.   1.7] 2
[ 6.3  3.3  6.   2.5] 2
Kevin
  • 7,960
  • 5
  • 36
  • 57
9

Actually a very simple way to do this is:

clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]

The second row returns all the elements of the df that belong to the 0th cluster. Similarly you can find the other cluster-elements.

gonidelis
  • 885
  • 10
  • 32
  • This is elegant, but I wonder if there is a way to retrieve the indexes of the elements in `df` that has label 0 in this case. – galactica Jan 13 '23 at 02:14
5

To get the IDs of the points/samples/observations that are inside each cluster, do this:

Python 2

Example using Iris data and a nice pythonic way:

import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets

np.random.seed(0)

# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# KMeans with 3 clusters
clf =  KMeans(n_clusters=3)
clf.fit(X,y)

#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_

# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}

# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
    temp = [key,value]
    dictlist.append(temp)

RESULTS

#dict format
{0: array([ 50,  51,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
            64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
            78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
            91,  92,  93,  94,  95,  96,  97,  98,  99, 101, 106, 113, 114,
           119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
 1: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
           17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
           34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
 2: array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
           115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
           134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}

# list format
[[0, array([ 50,  51,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
             64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
             78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
             91,  92,  93,  94,  95,  96,  97,  98,  99, 101, 106, 113, 114,
             119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
 [1, array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
 [2, array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
             115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
             134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]

Python 3

Just change

for key, value in mydict.iteritems():

to

for key, value in mydict.items():
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • 1
    For those who are working with python3 and encountering a problem with this solution, you just need to change iteritems() to items() – Serdar Sayın Jun 17 '20 at 10:21
  • Indeed my answer is in python2. I am going to updated now for python3 as well. cheers – seralouk Jun 17 '20 at 10:50
3

You can look at attribute labels_

For example

km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)

As you can see first and second point is cluster 1, last point in cluster 0.

Farseer
  • 4,036
  • 3
  • 42
  • 61
  • Yes this method would work. but when there are lot of data point iterating through all of them to get the labels is not efficient right. I just was the list of data points for a given cluster. Isn't there another way to do this? – user77005 Mar 25 '16 at 00:27
  • @user77005 see the answer that I just posted – seralouk Jun 11 '18 at 18:44
0

You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.

Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.