1

I am working on text clustering. I would need to plot the data using different colours. I used kmeans method for clustering and tf-idf for similarity.

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

plt.scatter(data2D[:,0], data2D[:,1])

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])

Currently, my output looks like: enter image description here there are a few elements as it is a test. I would need to add labels (they are strings) and differentiate dots by clusters: each cluster should have its own colour to make the reader easy to analyse the chart.

Could you please tell me how to change my code in order to include both labels and colours? I think any example it would be great.

A sample of my dataset is (the output above was generated from a different sample):

Sentences

Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
still_learning
  • 776
  • 9
  • 32

2 Answers2

2

We can use an example dataset:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

newsgroups = fetch_20newsgroups(subset='train',
                                categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target

pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

And do KMeans like you did, obtaining the clusters and centers, so just adding a name for the cluster:

kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]

You can add the colors by providing the cluster to "c=" and calling a colormap from cm or defining you own map:

plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
    plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")

enter image description here

You can also consider using seaborn:

sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
1

Picking up on your code try the following:

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_

cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}

fig, ax = plt.subplots()
for g in np.unique(group):
    ix = np.where(group == g)
    ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()

I'm assuming your kmeans has n_clusters=3. The cdict and ldict need to be set up accordingly with the number of clusters. In this case cluster 0 will be red with label label_1, cluster 1 will be blue with label label_2 and so on.

EDIT: I changed the keys of cdict to start from 0. EDIT 2: Added labels.

Carlos Azevedo
  • 660
  • 3
  • 13