31

I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords.

Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in gensim LDA?

Basically i would like to do something like this, but in python and using gensim.

LDA with topicmodels, how can I see which topics different documents belong to?

Community
  • 1
  • 1
jxn
  • 7,685
  • 28
  • 90
  • 172
  • `gensim` is a cool and simple library. The dev, Radim is a also a nice guy to approach about his library. do you need something that cluster the documents by topics? – alvas Jan 08 '14 at 07:55

3 Answers3

34

Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this 'hacky' method.

from gensim import corpora, models, similarities
from itertools import chain

""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
                               update_every=1, chunksize=10000, passes=1)

# Prints the topics.
for top in lda.print_topics():
  print top
print

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

print cluster1
print cluster2
print cluster3

[out]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user

0.333333333333

['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']

Just to make it clearer:

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
    for topic in doc:
        for topic_id, score in topic:
            scores.append(score)
threshold = sum(scores)/len(scores)

The above code is sum the score of all words and in all topics for all documents. Then normalize the sum by the number of scores.

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    this looks like a good solution! Another solution i found was to use the topic distribution to do k-means clustering. as seen in this link http://stackoverflow.com/questions/6486738/clustering-using-latent-dirichlet-allocation-algo-in-gensim but i am not sure how to implement it. Would you know how to do it? – jxn Jan 08 '14 at 17:38
  • 2
    i'm trying to re-implement brown (http://stackoverflow.com/questions/20998832/what-does-the-brown-clustering-algorithm-output-mean) too, but given (topic,prob) tuples, you can try this script from http://stackoverflow.com/questions/20990538/how-can-i-cluster-a-list-of-a-list-of-tuple-tag-probability-python – alvas Jan 08 '14 at 17:45
  • How could you use more clusters, based on how many topics you have? – dh762 Jan 11 '14 at 20:03
  • that's the scary part, no one knows what is the best number of topics to set, no one knows the best number of clusters to exact. I'm no computer scientist but i'm sure there's someone who somehow determine the optimal number of topics/clusters. – alvas Jan 11 '14 at 20:13
  • 2
    I've gotten much better performance by removing unique words like in [this question](http://stackoverflow.com/questions/21100903/improve-performance-remove-all-strings-in-a-big-list-appearing-only-once) – dh762 Jan 13 '14 at 21:00
  • i'm confused??? didn't the code i have up there already remove words that occurred once? – alvas Jan 13 '14 at 23:01
  • Ahhh, now i see, performance as in speed, not accuracy/precision =) – alvas Jan 13 '14 at 23:02
  • 3
    can you explain this line of code more specifically? `scores = list(chain(*[[score for topic,score in topic] \ for topic in [doc for doc in lda_corpus]])) threshold = sum(scores)/len(scores)` – jxn Dec 04 '14 at 06:20
  • And, how did you get the numbers for `[j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]` at the [0][1] part? – jxn Dec 04 '14 at 06:56
  • The outer index is to access the topics numbers from the `lda_corpus`, the inner index is to access the topic score. Actually you should print it out for yourself try this `print [i for i in lda_corpus]` then `[i[1] for i in lda_corpus`, then try `lda_corpus[0][1]`, – alvas Dec 04 '14 at 07:14
  • would you know how a topic score is computed? – jxn Dec 04 '14 at 07:49
  • Go through the materials from http://www.cs.princeton.edu/~blei/topicmodeling.html – alvas Dec 04 '14 at 08:12
  • @alvas I'd recommend using mallet with prior optimisation turned on (the default I think) and a large number of topics. This is effectively the same as using heirarchical (prior) topic models that essentially infer the number of topics (yes, they do exist), as many of the topics found by mallet end up with very few words assigned to them. btw: you can run mallet from gensim. – drevicko Jul 07 '16 at 11:24
  • LDA gives overlapping clusters and not distinct clusters. https://stackoverflow.com/questions/49380258/inefficiency-of-topic-modelling-for-text-clustering – StatguyUser Mar 20 '18 at 09:17
  • here cluster implies topics? – Bhaskar Dhariyal Dec 12 '18 at 09:49
13

If you want to use the trick of

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

in the previous answer by alvas, make sure to set minimum_probability=0 in LdaModel

gensim.models.ldamodel.LdaModel(corpus,
            num_topics=num_topics, id2word = dictionary,
            passes=2, minimum_probability=0)

Otherwise the dimension of lda_corpus and documents may not agree since gensim will suppress any corpus with probability lower than minimum_probability.

An alternative way to group documents into topics is to assign topics according to the maximum probability

    lda_corpus = [max(prob,key=lambda y:y[1])
                    for prob in lda[mm] ]
    playlists = [[] for i in xrange(topic_num])]
    for i, x in enumerate(lda_corpus):
        playlists[x[0]].append(documents[i])

Note lda[mm] is roughly speaking a list of lists, or 2D matrix. The number of rows is the number of documents and the number of columns is the number of topics. Each matrix element is a tuple of the form (3,0.82) for example. Here 3 refers to the topic index and 0.82 the corresponding probability to be of that topic. By default, minimum_probability=0.01 and any tuple with probability less than 0.01 is omitted in lda[mm]. You can set it to be 1/#topics if you use the grouping method with maximum probability.

nos
  • 19,875
  • 27
  • 98
  • 134
  • Yes, setting by maximum probability is what I thought about too after as well! Thanks for showing the implementation – jxn May 02 '16 at 00:27
  • Hey @nos, could you explain me what does the first part of the code do: in particular, [0][1] > threshold part? what do these number represent? – Economist_Ayahuasca May 26 '16 at 14:37
  • 1
    @AndresAzqueta the elements of lda_corpus are of the form [(0, p0), (1, p1), ...], where the 1st number is the topic index and the 2nd number is the corresponding probability of the document belonging to that topic. If there is N topics, then that list contains N tuples. However, if minimum_probability is not 0, then the tuple with probability lower than minimum_probability is not included in that list. – nos May 26 '16 at 23:32
  • Hey @nos, thanks very much for the answer. So if I have five topics, the series would be: [0][1] > threshold, [1][1] > threshold, [2][1] > threshold, [3][1] > threshold, [4][1] > threshold? Thanks – Economist_Ayahuasca May 27 '16 at 07:51
2

lda_corpus[i][j] are of the form [(0,t1),(0,t2)...,(0,t10),....(n,t10)] where the 1st term denotes the document index and the 2nd term denotes the probability of the topic in that particular document.

ayushi
  • 21
  • 1