I want to cluster 1 million list of strings(names) which is converted into a tf-idf matrix using frequency-inverse document frequency (tf-idf) vectoriser parameters with names /4 clusters on Spark ec2 instance with 8GB RAM and 64 GB memory.
str_no_cluster=names/4
km = KMeans(n_clusters=str_no_cluster,n_init=5,max_iter=30,n_jobs=-1)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(len(clusters))
frame = pd.DataFrame(names_dict,index = [clusters],columns=names_dict.keys())
print(frame)
when i run python Kmeans program its halt for 10 to 15 min then stops executing program. Any ideas how to make it faster and how much RAM and memory requires?