I am working on text clustering. I would need to plot the data using different colours.
I used kmeans
method for clustering and tf-idf
for similarity.
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])
Currently, my output looks like: there are a few elements as it is a test.
I would need to add labels (they are strings) and differentiate dots by clusters: each cluster should have its own colour to make the reader easy to analyse the chart.
Could you please tell me how to change my code in order to include both labels and colours? I think any example it would be great.
A sample of my dataset is (the output above was generated from a different sample):
Sentences
Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.