1

So, I'm messing around with gensim and I've got it to print the top 5 topics and popular nouns associated with the topics (this was done using the example here Topic Distribution and clustering using LDA). I'm working with 51 documents in my case. I'm having difficulty getting my last two clusters to work as I keep receiving a "list index out of range" error. I'm completely clueless about what changes I could make to fix my clusters. The method I attempted using if and else conditions gave an incorrect first cluster (you'll spot it commented out). Where exactly am I going wrong?

    from gensim import corpora, models, similarities
from itertools import chain

# list of tokenised nouns from the noun documents
nounTokens = []

for index, row in df_Data.iterrows():
    nounTokens.append(df_Data.iloc[index]['Noun Tokens'])

# create a dictionary using noun Tokens
id2word = corpora.Dictionary(nounTokens)

# creates the bag of word corpus
mm = [id2word.doc2bow(noun) for noun in nounTokens]

# trains lda models
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=5, update_every=1, chunksize=10000, passes=1)

# prints the topics of the corpus
for topics in lda.print_topics():
    print(topics)
print

lda_corpus = lda[mm]

# search for scores of all the words under each topic for all documents
scores = list(chain(*[[score for topic_id, score in topic] 
                      for topic in [doc for doc in lda_corpus]]))
# calculating the avg sum of all the probabilities to ensure we have a valid threshold.
threshold = sum(scores)/len(scores)
print(threshold)
print
# cluster1 = []
# cluster2 = []
# cluster3 = []

# for i,j in zip(lda_corpus, noun_Docs):
#     if len(i) > 0:
#         if i[0][1] > threshold:
#             cluster1.append(j)
#     elif len(i)>1:
#         if i[1][1] > threshold:
#             cluster2.append(j)
#     elif len(i) > 2:
#         if i[2][1] > threshold:
#             cluster3.append(j)

cluster1 = [j for i, j in zip(lda_corpus, noun_Docs) if i[0][1] > threshold]
cluster2 = [j for i, j in zip(lda_corpus, noun_Docs) if i[1][1] > threshold]
cluster3 = [j for i, j in zip(lda_corpus, noun_Docs) if i[2][1] > threshold]
# for i,j in zip(lda_corpus, noun_Docs):
#     print(i)

print(cluster1)
# print(cluster2)
# print(cluster3)
Blank
  • 155
  • 4
  • 16

0 Answers0