21

I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA).

However, I am interested whether it is a good idea to try out topic modeling with Word2Vec as it clusters words in vector space. Couldn't the clusters therefore be regarded as topics?

Do you think it makes sense to follow this approach for the sake of some research? In the end what I am interested in is to extract keywords from text according to topics.

user1814735
  • 261
  • 1
  • 2
  • 5
  • 2
    I tried something along these lines recently. You can get coherent topics by clustering Word2Vec (or GloVe) vectors: goo.gl/irZ5xI – duhaime Oct 07 '15 at 01:56
  • You can do this certainly, but I won't call it topic modelling. – Sir Cornflakes Oct 07 '15 at 07:58
  • @duhaime thanks for your reply! What you are working on is exactly what I am looking for! Do you know by any chance how well the clusters can be compared to topics that are extracted by e.g. LDA? since I am new to this topic I would be very glad if you could give me keywords with which I can find related research papers – user1814735 Oct 07 '15 at 12:50
  • @jknappen what would you call this topic instead? clustering? – user1814735 Oct 07 '15 at 12:52
  • Yes, clustering (and the result of the clustering are clusters). – Sir Cornflakes Oct 07 '15 at 13:26
  • Topic models (at least in LDA and NMF) are essentially distributions over a fixed vocabulary. Each word in the vocabulary has a certain probability 0:1 of being in each topic. The hard clustering technique I discussed above places words into discrete groups, so each word has membership in exactly one cluster. You could measure distance from a word to each cluster to get a continuous distance representation. I hope this helps! – duhaime Oct 07 '15 at 15:57
  • Yes that helps a lot! Thank you! – user1814735 Oct 08 '15 at 20:11
  • @user1814735 can you explain a bit more on ur approach ? I was thinking on similar lines and I wanted to know how can a document be represented in a vector format. I know that word2vec gives a vector for each word like `dog` but how do I get a vector for a document like `its a cute dog` using pre-trained word2vec models ? – Regressor Nov 02 '20 at 17:49

3 Answers3

11

You might want to look at the following papers:

Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313. [CODE]

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun. 2015. Topical Word Embeddings. In proceedings of 29th AAAI Conference on Artificial Intelligence, 2418-2424. [CODE]

The first paper integrates word embeddings into the LDA model and the one-topic-per-document DMM model. It reports significant improvements on topic coherence, document clustering and document classification tasks, especially on small corpora or short texts (e.g Tweets).

The second paper is also interesting. It uses LDA to assign topic for each word, and then employs Word2Vec to learn word embeddings based on both words and their topics.

NQD
  • 470
  • 5
  • 8
4

Two people have tried to solve this.

Chris Moody at StichFix came out with LDA2Vec, and some Ph.D students at CMU wrote a paper called "Gaussian LDA for Topic Models with Word Embeddings" with code here... though I could not get the Java code there to output sensical results. Its an interesting idea of using word2vec with gaussian (actually T-distributions when you work out the math) word-topic distributions. Gaussian LDA should be able to handle out of vocabulary words from the training.

LDA2Vec attempts to train both the LDA model and word-vectors at the same time, and it also allows you to put LDA priors over non-words to get really interesting results.

Mansweet
  • 161
  • 7
0

In Word2Vec,Consider 3 sentences
“the dog saw a cat”,
“the dog chased the cat”,
“the cat climbed a tree”
Here we give input word 'cat', then we will get output word as 'climbed'

its based on the probability of all words given context word(cat). Its a continuous bag of words model. We will get words similar to the input word based on the context. Word2Vec works well in huge data set only.

LDA is used to abstract topics from a corpus. Its not based on context. As it uses Dirichlet distribution to draw words over topics and draw topics over documents. The problem we face here is randomness. We get different outputs each time.

The technique we choose depends upon our requirements.

Thomas N T
  • 459
  • 1
  • 3
  • 14
  • You can control the randomness in LDA by setting a random seed (e.g. with mallet). This gives you replicable results. It does not change the fact that different random seeds give different topic models. – Sir Cornflakes Oct 23 '15 at 19:31
  • ok.I have implemented in python (gensim). I did an iteration of 20 times and took an intersection of all output topics. Theoretically, according to Dirichlet distribution, the output is random each time.I didn't used mallet in java. Thanks @jknappen for the information. – Thomas N T Oct 24 '15 at 16:02