Cluster using different colours and labels

Question

I am working on text clustering. I would need to plot the data using different colours. I used kmeans method for clustering and tf-idf for similarity.

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

plt.scatter(data2D[:,0], data2D[:,1])

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])

Currently, my output looks like: there are a few elements as it is a test. I would need to add labels (they are strings) and differentiate dots by clusters: each cluster should have its own colour to make the reader easy to analyse the chart.

Could you please tell me how to change my code in order to include both labels and colours? I think any example it would be great.

A sample of my dataset is (the output above was generated from a different sample):

Sentences

Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.

Here I see a perfect case for using `plotly`. Do you mind to provide a [mcve](/help/mcve)? At least your original df with a column for cluster. — rpanai, May 23 '20 at 00:50
does this help [adding colors and labels](https://stackoverflow.com/questions/47006268/matplotlib-scatter-plot-with-color-label-and-legend-specified-by-c-option) — Carlos Azevedo, May 23 '20 at 01:00

score 2 · Accepted Answer · answered May 23 '20 at 13:30

We can use an example dataset:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

newsgroups = fetch_20newsgroups(subset='train',
                                categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target

pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

And do KMeans like you did, obtaining the clusters and centers, so just adding a name for the cluster:

kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]

You can add the colors by providing the cluster to "c=" and calling a colormap from cm or defining you own map:

plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
    plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")

You can also consider using seaborn:

sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")

Carlos Azevedo · Answer 2 · 2020-05-23T01:42:08.820

Picking up on your code try the following:

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_

cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}

fig, ax = plt.subplots()
for g in np.unique(group):
    ix = np.where(group == g)
    ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()

I'm assuming your kmeans has n_clusters=3. The cdict and ldict need to be set up accordingly with the number of clusters. In this case cluster 0 will be red with label label_1, cluster 1 will be blue with label label_2 and so on.

EDIT: I changed the keys of cdict to start from 0. EDIT 2: Added labels.

Cluster using different colours and labels

2 Answers2