How to get the top N frequent words in each cluster? Sklearn

Question

I have a text corpus that contains 1000+ articles each in a separate line. I used Hierarchy Clustering using Sklearn in python to produce clusters of related articles. This is the code I used to do the clustering

Note: X, is a sparse NumPy 2D array with rows corresponding to documents and columns corresponding to terms

# Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(affinity="euclidean",linkage="complete",n_clusters=3)
model.fit(X.toarray())
clustering = model.labels_
print (clustering)

I specify the number of clusters = 3 at which to cut off the tree to get a flat clustering like K-mean

My question is : How to get the top N frequent words in each cluster? so that I can suggest a topic for each cluster. Thanks

Please don't create duplicate questions [Text clustering using Scipy Hierarchy Clustering in Python](http://stackoverflow.com/questions/43707062/text-clustering-using-scipy-hierarchy-clustering-in-python) — Has QUIT--Anony-Mousse, May 01 '17 at 18:56
There is a similar question, I hope it might help someone: https://stackoverflow.com/questions/72260769/doc2vec-infer-words-from-vectors — frogseer, May 23 '22 at 09:59

elz · Accepted Answer · 2017-05-01T17:46:22.743

One option is to convert X from the sparse numpy array to a pandas dataframe. The rows will still correspond to documents, and the columns to words. If you have a list of your vocabulary in order of your array columns (used as your_word_list below) you could try something like this:

import pandas as pd

X = pd.DataFrame(X.toarray(), columns=your_word_list)  # columns argument is optional
X['Cluster'] = clustering  # Add column corresponding to cluster number
word_frequencies_by_cluster = X.groupby('Cluster').sum()

# To get sorted list for a numbered cluster, in this case 1
print word_frequencies_by_cluster.loc[1, :].sort(ascending=False)

As a side note, you may want to look into algorithms (e.g. LDA) and distance metrics (cosine) that are more commonly used for natural language processing. If you are looking to extract topics, there is a nice sklearn tutorial on topic modeling.

X, is a sparse NumPy 2D array with rows corresponding to documents and columns corresponding to terms. — user6872853, May 01 '17 at 17:11
OK, I edited to address that. That would be a good detail to include in your original post. — elz, May 01 '17 at 17:47

How to get the top N frequent words in each cluster? Sklearn

1 Answers1