29

I am trying to understand how gensim package in Python implements Latent Dirichlet Allocation. I am doing the following:

Define the dataset

documents = ["Apple is releasing a new product", 
             "Amazon sells many things",
             "Microsoft announces Nokia acquisition"]             

After removing stopwords, I create the dictionary and the corpus:

texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then I define the LDA model.

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, update_every=1, chunksize=10000, passes=1)

Then I print the topics:

>>> lda.print_topics(5)
['0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product', '0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new', '0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is', '0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new', '0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft']
2013-12-03 13:26:21,878 : INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product
2013-12-03 13:26:21,880 : INFO : topic #1: 0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new
2013-12-03 13:26:21,880 : INFO : topic #2: 0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is
2013-12-03 13:26:21,881 : INFO : topic #3: 0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new
2013-12-03 13:26:21,881 : INFO : topic #4: 0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft
>>> 

I'm not able to understand much out of this result. Is it providing with a probability of the occurrence of each word? Also, what's the meaning of topic #1, topic #2 etc? I was expecting something more or less like the most important keywords.

I already checked the gensim tutorial but it didn't really help much.

Thanks.

sophros
  • 14,672
  • 11
  • 46
  • 75
visakh
  • 2,503
  • 8
  • 29
  • 55
  • 1
    Just so you know those numbers are the relative importance of each word in the topic. The reason they don't add upto 1 is because by default `print_topics` show 10. If you show 100 or so the sum will start getting close to 1. – sachinruk Oct 11 '16 at 00:23
  • see http://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html – Chris Nov 22 '17 at 19:44

5 Answers5

21

The answer you're looking for is in the gensim tutorial. lda.printTopics(k) prints the most contributing words for k randomly selected topics. One can assume that this is (partially) the distribution of words over each of the given topics, meaning the probability of those words appearing in the topic to the left.

Usually, one would run LDA on a large corpus. Running LDA on a ridiculously small sample won't give the best results.

James Mishra
  • 4,249
  • 4
  • 30
  • 35
Steve P.
  • 14,489
  • 8
  • 42
  • 72
  • ..thanks for the reply...any idea on how is it splitting up the corpus into five different topics? also, is it possible to pick out the top words alone rather than getting the distribution of the words for each topic...? I agree this is a really small sample, but wanted to understand it first before trying out with a bigger sample... – visakh Dec 03 '13 at 11:50
  • As 3 docs is indeed a small number, huge is relative here, we manage to get significant results for corpus of some hundreds docs, and have hard time when corpus cardinality exceeds tens of thousands. – alko Dec 03 '13 at 11:52
  • 1
    @user295338 You probably need to read some papers on LDA and its applications, [blei's original article](http://jmlr.org/papers/v3/blei03a.html) is a good start. – alko Dec 03 '13 at 11:54
  • @user295338 Yes, but it's complicated to explain, especially over this medium. I suggest reading some papers if you're mathematically mature, if not, it's going to take a lot of reading to understand. Luckily there are a lot of resources for LDA. If you have a strong statistics and probability background, you'll be fine, otherwise it's going to be a long road, but it's worth it! As for getting the words with the highest probabilities for a topic, you could just sort the output and go from there. – Steve P. Dec 03 '13 at 12:02
  • @user295338 The paper that alko linked is a great source, but may not be the best place for you to start. Again, it depends on your background. – Steve P. Dec 03 '13 at 12:05
  • 4
    @user295338 In case you lack math (probabilities and stuff) background, I just found a good explanation of LDA at Quora, http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation – alko Dec 03 '13 at 13:11
  • thanks for the input and the links...hopefully, i think I will be able to get this going...thanks again...:-) – visakh Dec 03 '13 at 19:59
19

I think this tutorial will help you understand everything very clearly - https://www.youtube.com/watch?v=DDq3OVp9dNA

I too faced a lot of problems understanding it at first. I'll try to outline a few points in a nutshell.

In Latent Dirichlet Allocation,

  • The order of words is not important in a document - Bag of Words model.
  • A document is a distribution over topics
  • Each topic, in turn, is a distribution over words belonging to the vocabulary
  • LDA is a probabilistic generative model. It is used to infer hidden variables using a posterior distribution.

Imagine the process of creating a document to be something like this -

  1. Choose a distribution over topics
  2. Draw a topic - and choose word from the topic. Repeat this for each of the topics

LDA is sort of backtracking along this line -given that you have a bag of words representing a document, what could be the topics it is representing ?

So, in your case, the first topic (0)

INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product

is more about things , amazon and many as they have a higher proportion and not so much about microsoft or apple which have a significantly lower value.

I would suggest reading this blog for a much better understanding ( Edwin Chen is a genius! ) - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

Utsav T
  • 1,515
  • 2
  • 24
  • 42
11

Since the above answers were posted, there are now some very nice visualization tools for gaining an intuition of LDA using gensim.

Take a look at the pyLDAvis package. Here is a great notebook overview. And here is a very helpful video description geared toward the end user (9 min tutorial).

Hope this helps!

plfrick
  • 1,109
  • 12
  • 12
2

For understanding the usage of gensim LDA implementation, I have recently penned blog-posts implementing topic modeling from scratch on 70,000 simple-wiki dumped articles in Python.

In here, there is a detailed explanation of how gensim's LDA can be used for topic modeling. One can find the usage of

ElementTree library for extraction of article text from XML dumped file.
Regex filters to clean the articles.
NLTK stop words removal & Lemmatization
LDA from gensim library

Hope it will help understanding the LDA implementation of gensim package.

Part 1

Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump

Part 2

Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation

Word cloud (10 words) of few topics that i got as an outcome. enter image description here

Abhijeet Singh
  • 154
  • 1
  • 8
0

It is returning the percent likelihood that that word is associated with that topic. By default the LDA shows you the top ten words :)

Sara
  • 1,162
  • 1
  • 8
  • 21