How to Cluster Multidimentional and Unkown Data using KMeans?

Question

I have two questions regarding Kmeans Clustering using Python.

I have an auto-generated data called Mystery.npy and its shape is (30309, 784). I am trying to apply the KMeans clustering on it but, I am getting the following error:

valueerror: the truth value of an array with more than one element is ambiguous. use a.any() or a.all()

Do you have any idea how to overcome this error, or how to cluster such data with KMeans method?

The second question, Is there a certain code to know the type of data that I have?

Your assistance is highly appreciated. Thanks,

The task is to cluster the data and visualize the data including assigned cluster labels. The data can be found here: https://drive.google.com/open?id=1Fkloi7js4Fsji0pWN6Bmn5LgPXI64egm — matthewninja, Oct 23 '19 at 02:56

razimbres · Accepted Answer · 2019-10-29T22:46:15.543

@Nael Alsaleh, you can run K-Means the following way:

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

X=np.load('Mistery.npy')

wx = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, random_state = 0)
    kmeans.fit(X)
    wx.append(kmeans.inertia_)
plt.plot(range(1, 11), wx)
plt.xlabel('Number of clusters')
plt.ylabel('Variance Explained')
plt.show()

Note that X is a numpy array. This code will create the elbow curve, where you can select the perfect number of clusters, in this case, 5-6.

If you are working with numpy, you will have an array:

array([0.86992608, 0.11252552, 0.25573737, ..., 0.32652233, 0.14927118,
        0.1662449 ])

You may also be working with a list,

[0.86992608, 0.11252552, 0.25573737, ..., 0.32652233, 0.14927118,
        0.1662449 ]

that you will need to convert to array: np.array(X), or even a Pandas Dataframe:

You can check column types in a Pandas Dataframe by doing:

import pandas as pd
pd.DataFrame(X).dtypes

In numpy, x.dtype

After converting data to an array, run:

n=5
kmeans=KMeans(n_clusters=n, random_state=20).fit(X)
labels_of_clusters = kmeans.fit_predict(X)

This will get you the number of the cluster class that each example belongs.

array([1, 4, 0, 0, 4, 1, 4, 0, 2, 0, 0, 4, 3, 1, 4, 2, 2, 3, 0, 1, 1, 0,
       4, 4, 2, 0, 3, 0, 3, 1, 1, 2, 1, 0, 2, 4, 0, 3, 2, 1, 1, 2, 2, 2,
       2, 0, 0, 4, 1, 3, 1, 0, 1, 4, 1, 0, 0, 0, 2, 0, 1, 2, 2, 1, 2, 2,
       0, 4, 4, 4, 4, 3, 1, 2, 1, 2, 2, 1, 1, 3, 4, 3, 3, 1, 0, 1, 2, 2,
       1, 2, 3, 1, 3, 3, 4, 2, 2, 0, 2, 1, 3, 4, 2, 0, 2, 1, 3, 3, 3, 4,
       3, 1, 4, 4, 4, 2, 0, 3, 2, 0, 1, 2, 2, 0, 3, 1, 1, 1, 4, 0, 2, 2,
       0, 0, 1, 1, 0, 3, 0, 2, 2, 1, 2, 2, 4, 0, 1, 0, 3, 1, 4, 4, 0, 4,
       1, 2, 0, 2, 4, 0, 1, 2, 3, 1, 1, 0, 3, 2, 4, 0, 1, 3, 1, 2, 4, 3,
       1, 1, 2, 0, 0, 2, 3, 1, 3, 4, 1, 2, 2, 0, 2, 1, 4, 3, 1, 0, 3, 2,
       4, 1, 4, 1, 4, 4, 0, 4, 4, 3, 1, 3, 4, 0, 4, 2, 1, 1, 3, 4, 0, 4,
       4, 4, 4, 2, 4, 2, 3, 4, 3, 3, 1, 1, 4, 2, 3, 0, 2, 4])

To visualize:

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=200, centers=4,
                       cluster_std=0.60, random_state=0)

kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
cc=kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=cc, s=50, cmap='viridis')

score 1 · Answer 2 · answered Oct 26 '19 at 10:37

What you want to do can be done using scikit-learns KMeans module, here is a working example using your data:

import numpy as np
from sklearn.cluster import KMeans
# loading your data from .npy-file
mystery = np.load('mystery.npy')
# n_clusters is a hyperparameter set by you
kmeans = KMeans(n_clusters=42, n_jobs=-1).fit(mystery[:1000])
pred = kmeans.predict(mystery[1000:1200])
print(pred)
array([36, 16, 21, 15, 15,  0,  5,  7, 31, 33, 10, 14,  1, 36, 30, 22, 12,
        1, 35, 12, 16, 12, 28, 14, 13, 15,  2, 21, 36,  7,  7,  4, 39,  4,
        4, 18,  5, 31, 17,  2,  2, 26, 38, 34, 34, 36, 13, 13, 26,  1, 26,
        8, 38,  0, 38, 34,  0, 21, 36, 12, 16, 38, 23, 15,  0,  6, 34,  0,
       19,  7,  8, 21, 16, 36, 24,  0,  4, 22, 33, 21, 12, 12,  2, 10, 23,
        2,  3,  0, 12,  0, 24, 21, 12, 33,  4, 14, 34, 10, 21,  0, 33, 26,
       36,  2, 12, 34, 29, 27, 33,  3, 12, 12, 15, 39, 34, 26, 26, 16,  8,
        2, 12,  0, 21, 15, 40, 16, 38, 22, 26, 36, 17,  3, 12,  3, 23, 39,
       34, 36, 33, 38, 15, 21,  7, 34, 23, 33, 34, 33, 26, 34, 26, 30, 16,
        2,  3,  0, 33, 34, 39, 12,  5, 34, 26, 33, 30, 39, 12,  2, 15, 29,
       12, 38, 36, 10, 36, 28,  1, 19, 12, 17, 32, 35, 11, 16, 28, 18, 14,
       15, 31, 34, 19,  0, 17, 12, 11, 39, 18, 26, 31,  0], dtype=int32)

If you want to use the full data set, kmeans.fit(mystery) may take some time, for testing purposes I used only the first 1000 instances and predicted the foloowing 200 instances.

How to Cluster Multidimentional and Unkown Data using KMeans?

2 Answers2