Questions tagged [lda]

Latent Dirichlet Allocation, LDA, is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

It should not be confused with Linear Discriminant Analysis, a supervised learning procedure for classifying observations into a set of categories.

1175 questions
44
votes
6 answers

Remove empty documents from DocumentTermMatrix in R topicmodels?

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <-…
Bill M
  • 711
  • 1
  • 6
  • 8
44
votes
2 answers

LDA topic modeling - Training and testing

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents. References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can…
tan
  • 1,569
  • 5
  • 14
  • 30
34
votes
3 answers

Python Gensim: how to calculate document similarity using the LDA model?

I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a…
still_st
  • 363
  • 1
  • 3
  • 7
32
votes
2 answers

Simple Python implementation of collaborative topic modeling?

I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF)…
jxn
  • 7,685
  • 28
  • 90
  • 172
31
votes
3 answers

Topic distribution: How do we see which document belong to which topic after doing LDA in python

I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords. Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in…
jxn
  • 7,685
  • 28
  • 90
  • 172
29
votes
5 answers

Understanding LDA implementation using gensim

I am trying to understand how gensim package in Python implements Latent Dirichlet Allocation. I am doing the following: Define the dataset documents = ["Apple is releasing a new product", "Amazon sells many things", …
visakh
  • 2,503
  • 8
  • 29
  • 55
29
votes
5 answers

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear. In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z)…
Chelsea Wang
  • 599
  • 2
  • 5
  • 19
26
votes
2 answers

What's the disadvantage of LDA for short texts?

I am trying to understand why Latent Dirichlet Allocation(LDA) performs poorly in short text environments like Twitter. I've read the paper 'A biterm topic model for short text', however, I still do not understand "the sparsity of word…
Shuguang Zhu
  • 263
  • 1
  • 3
  • 5
26
votes
10 answers

How to print the LDA topics models from gensim? Python

Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a…
alvas
  • 115,346
  • 109
  • 446
  • 738
24
votes
1 answer

Export pyLDAvis graphs as standalone webpage

i am analysing text with topic modelling and using Gensim and pyLDAvis for that. Would like to share the results with distant colleagues, without a need for them to install python and all required libraries. Is there a way to export interactive…
Darius
  • 596
  • 1
  • 6
  • 22
21
votes
1 answer

Predicting LDA topics for new data

It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking…
David
  • 9,284
  • 3
  • 41
  • 40
20
votes
3 answers

How does the removeSparseTerms in R work?

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix. How does this method work and what is the logic…
London guy
  • 27,522
  • 44
  • 121
  • 179
19
votes
4 answers

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate…
alvas
  • 115,346
  • 109
  • 446
  • 738
19
votes
3 answers

LDA with topicmodels, how can I see which topics different documents belong to?

I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the…
d12n
  • 841
  • 2
  • 10
  • 20
17
votes
2 answers

Document topical distribution in Gensim LDA

I've derived a LDA topic model using a toy corpus as follows: documents = ['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface…
Moses Xu
  • 2,140
  • 4
  • 24
  • 35
1
2 3
78 79