0

I want to cluster 1 million list of strings(names) which is converted into a tf-idf matrix using frequency-inverse document frequency (tf-idf) vectoriser parameters with names /4 clusters on Spark ec2 instance with 8GB RAM and 64 GB memory.

 str_no_cluster=names/4
km = KMeans(n_clusters=str_no_cluster,n_init=5,max_iter=30,n_jobs=-1)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(len(clusters))
frame = pd.DataFrame(names_dict,index = [clusters],columns=names_dict.keys())
print(frame)

when i run python Kmeans program its halt for 10 to 15 min then stops executing program. Any ideas how to make it faster and how much RAM and memory requires?

Sid Mhatre
  • 3,272
  • 1
  • 19
  • 38
  • Please provide us Spark logs and - if possible - web ui metrics – T. Gawęda Sep 15 '16 at 12:49
  • By the way, code snippet that you've provided only say that you're calculating centroids. You are not evaluating your results so nothing will be shown in console – T. Gawęda Sep 15 '16 at 12:52
  • not able to get spark logs but result processing done by clusters = km.labels_.tolist() print(len(clusters)) frame = pd.DataFrame(names_dict, index = [clusters], columns=names_dict.keys()) and its working for <30000 names data – Sid Mhatre Sep 15 '16 at 12:58

0 Answers0