Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

980 questions
44
votes
6 answers

Remove empty documents from DocumentTermMatrix in R topicmodels?

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <-…
Bill M
  • 711
  • 1
  • 6
  • 8
44
votes
2 answers

LDA topic modeling - Training and testing

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents. References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can…
tan
  • 1,569
  • 5
  • 14
  • 30
32
votes
2 answers

Simple Python implementation of collaborative topic modeling?

I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF)…
jxn
  • 7,685
  • 28
  • 90
  • 172
29
votes
2 answers

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout…
user37874
  • 415
  • 1
  • 5
  • 11
29
votes
5 answers

Understanding LDA implementation using gensim

I am trying to understand how gensim package in Python implements Latent Dirichlet Allocation. I am doing the following: Define the dataset documents = ["Apple is releasing a new product", "Amazon sells many things", …
visakh
  • 2,503
  • 8
  • 29
  • 55
26
votes
2 answers

What's the disadvantage of LDA for short texts?

I am trying to understand why Latent Dirichlet Allocation(LDA) performs poorly in short text environments like Twitter. I've read the paper 'A biterm topic model for short text', however, I still do not understand "the sparsity of word…
Shuguang Zhu
  • 263
  • 1
  • 3
  • 5
26
votes
10 answers

How to print the LDA topics models from gensim? Python

Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a…
alvas
  • 115,346
  • 109
  • 446
  • 738
25
votes
2 answers

Gensim: KeyError: "word not in vocabulary"

I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34: b = ['let', 'know', 'buy', 'someth', 'featur', 'mashabl', 'might', 'earn', 'affili', …
Krishnang K Dalal
  • 2,322
  • 9
  • 34
  • 55
24
votes
1 answer

Export pyLDAvis graphs as standalone webpage

i am analysing text with topic modelling and using Gensim and pyLDAvis for that. Would like to share the results with distant colleagues, without a need for them to install python and all required libraries. Is there a way to export interactive…
Darius
  • 596
  • 1
  • 6
  • 22
21
votes
3 answers

Using Word2Vec for topic modeling

I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA). However, I am interested whether it is a good idea to try out topic modeling with Word2Vec as it clusters…
user1814735
  • 261
  • 1
  • 2
  • 5
21
votes
1 answer

Predicting LDA topics for new data

It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking…
David
  • 9,284
  • 3
  • 41
  • 40
20
votes
6 answers

Using scikit-learn vectorizers and vocabularies with gensim

I am trying to recycle scikit-learn vectorizer objects with gensim topic models. The reasons are simple: first of all, I already have a great deal of vectorized data; second, I prefer the interface and flexibility of scikit-learn vectorizers; third,…
emiguevara
  • 1,359
  • 13
  • 26
19
votes
4 answers

LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate…
alvas
  • 115,346
  • 109
  • 446
  • 738
19
votes
3 answers

LDA with topicmodels, how can I see which topics different documents belong to?

I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the…
d12n
  • 841
  • 2
  • 10
  • 20
17
votes
2 answers

get_document_topics and get_term_topics in gensim

The ldamodel in gensim has the two methods: get_document_topics and get_term_topics. Despite their use in this gensim tutorial notebook, I do not fully understand how to interpret the output of get_term_topics and created the self-contained code…
tkja
  • 1,950
  • 5
  • 22
  • 40
1
2 3
65 66