3

I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles. This is the code I used to do the clustering

# Agglomerative Clustering
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as hac
tree = hac.linkage(X.toarray(), method="complete",metric="euclidean")
plt.clf()
hac.dendrogram(tree)
plt.show() 

and I got this plot dendrogram

Then I cut off the tree at the third level with fcluster()

from scipy.cluster.hierarchy import fcluster
clustering = fcluster(tree,3,'maxclust')
print(clustering)

and I got this output: [2 2 2 ..., 2 2 2]

My question is how can I find the top 10 frequent words in each cluster in order to suggest a topic for each cluster?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user6872853
  • 53
  • 2
  • 7

2 Answers2

1

You can do the following:

  1. Align your results (your clustering variable) with your input (the 1000+ articles).
  2. Using pandas library, you can use a groupby function with the cluster # as its key.
  3. Per group (using the get_group function), fill up a defaultdict of integers for every word you encounter.
  4. You can now sort the dictionary of word counts in descending order and get your desired number of most frequent words.

Good luck with what you're doing and please do accept my answer if it's what you're looking for.

noobalert
  • 855
  • 3
  • 10
  • 24
1

I'd do so. Given a df with article name and article text like

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Argument  6 non-null      object
 1   Article   6 non-null      object
dtypes: object(2)
memory usage: 224.0+ bytes

create the articles matrix

from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.feature_extraction.text import CountVectorizer

# initialize
cv = CountVectorizer(stop_words='english') 
cv_matrix = cv.fit_transform(df['Article']) 
# create document term matrix
df_dtm = pd.DataFrame(
    cv_matrix.toarray(), 
    index=df['Argument'].values, 
    columns=cv.get_feature_names()
)
tree = hierarchy.linkage(df_dtm, method="complete", metric="euclidean")

then get the chosen clustering

clustering = fcluster(tree, 2, 'maxclust')

and add clustering to df_dtm

df_dtm['_cluster_'] = clustering
df_dtm.index.name = '_article_'
df_word_count = df_dtm.groupby('_cluster_').sum().reset_index().melt(
    id_vars=['_cluster_'], var_name='_word_', value_name='_count_'
)

finally take the first most frequent words

words_1 = df_word_count[df_word_count._cluster_==1].sort_values(
    by=['_count_'], ascending=False).head(3)
words_2 = df_word_count[df_word_count._cluster_==2].sort_values(
    by=['_count_'], ascending=False).head(3)
Max Pierini
  • 2,027
  • 11
  • 17