1

Background:

I'm processing text (dataset with 1000 documents - applying Doc2Vec using Gensim lib), at the end I have a 300 dimension matrix for each doc.

So I did a cluster using K-means based on this model.

input:

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

kmeans_model = KMeans(n_clusters=5, init='k-means++', max_iter=100) 

X = kmeans_model.fit(model.docvecs.vectors_docs)

labels = kmeans_model.labels_.tolist()

l = kmeans_model.fit_predict(model.docvecs.vectors_docs)

pca = PCA(n_components=2).fit(model.docvecs.vectors_docs)

datapoint = pca.transform(model.docvecs.vectors_docs)

plt.figure(figsize=(12,12))

label1 = ['red', 'pink', 'lightgreen', 'lightblue', 'cyan']
color = [label1[i] for i in labels]

plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)

centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)

plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker='^', s=150, c='black')

plt.show()

output:

enter image description here

My dataset has a label column, where each document has a label (class 0, 1, 2, 3 or 4). I would like to check if this clustering is equal to those labels. I mean, I want to check if a document labeled as class 1 is grouped by k-means with others from the same class, for example. So I was thinking of using a different symbol for each class, like, in this plot all documents are represented as dots, can I do the plot with a different symbol for each class? Is that a good way to check this? (visually)

Also, how could I check the accuracy of this? To see how much of those 1000 documents were clustered with the others that have the same value in my dataset (column: df['Classes'])

U23r
  • 1,653
  • 10
  • 28
  • 44

0 Answers0