Clustering text data with Python's Scikit-Learn lib and plotting

Question

Im new to clustering and Im learning abut text clustering. I found a way to make clusters, and now Im trying to find a way to plot them. This is the error that I get when I want to plot cluster:

ValueError: setting an array element with a sequence.

This is my code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing'
     'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
x = cv.fit_transform(x)    

my_list = []

for i in range(1,8):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)

plt.plot(range(1,8),my_list)
plt.show()


kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)

plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()

What am I doing wrong, I want to see which sentences are being grouped in each cluster, is it even possible to plot like this? How can I test the significance of the clusters found?

to your second point: https://stackoverflow.com/questions/43784903/scikit-k-means-clustering-performance-measure — PV8, Aug 20 '19 at 12:39

score 0 · Answer 1 · answered Aug 20 '19 at 14:16

0

Initially your observations are sentences. After applying the CountVectorizer to them your observations are now 62 dimensional vectors. You are getting a value error from pyplot (it is not clear to me what you are trying to plot as your vectors are in such high dimention).

From what I see your model is going to be overly sensitive to pronouns ('this', 'that', etc ..). Many models remove these and other stop words

answered Aug 20 '19 at 14:16

DBaker

2,079
9
15

Thanks for your answer about stop words. Im wondering is it even possible to plot something like this, to represent cluster of sentences / words on graph – taga Aug 20 '19 at 14:28
your vector y_kmeans has the cluster number for each of your sentences. You can use it to see which sentences are being regrouped in each cluster – DBaker Aug 20 '19 at 14:34
And how to see that? – taga Aug 20 '19 at 15:28
So If i add `stop_words = 'english'`, it will automaticly remove words that does not have 'value'/'meaning'? I want from my cluster to plot the groups of sentences – taga Aug 21 '19 at 08:38

Clustering text data with Python's Scikit-Learn lib and plotting

1 Answers1