Background:
I'm processing text (dataset with 1000 documents - applying Doc2Vec using Gensim lib), at the end I have a 300 dimension matrix for each doc.
So I did a cluster using K-means based on this model.
input:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
kmeans_model = KMeans(n_clusters=5, init='k-means++', max_iter=100)
X = kmeans_model.fit(model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(model.docvecs.vectors_docs)
pca = PCA(n_components=2).fit(model.docvecs.vectors_docs)
datapoint = pca.transform(model.docvecs.vectors_docs)
plt.figure(figsize=(12,12))
label1 = ['red', 'pink', 'lightgreen', 'lightblue', 'cyan']
color = [label1[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker='^', s=150, c='black')
plt.show()
output:
My dataset has a label column, where each document has a label (class 0, 1, 2, 3 or 4). I would like to check if this clustering is equal to those labels. I mean, I want to check if a document labeled as class 1 is grouped by k-means with others from the same class, for example. So I was thinking of using a different symbol for each class, like, in this plot all documents are represented as dots, can I do the plot with a different symbol for each class? Is that a good way to check this? (visually)
Also, how could I check the accuracy of this? To see how much of those 1000 documents were clustered with the others that have the same value in my dataset (column: df['Classes'])