0

I want to apply LDA Algorithm to a corpus to find out Similar words if I am given a word or Phrase as Input. How can this be done?

Also, Does LDA ignore the order of words in a document? Does it also ignore the order of Documents in the corpus?

Can some other strategy also be used for searching similar words. The order of words in the Document does not matter because of the language of documents that I am using, that is My document is a bag of words and order of words doesn't matter.

ayush gupta
  • 607
  • 1
  • 6
  • 14
  • 1
    Possible duplicate of [how could I make a search match for similar words](https://stackoverflow.com/questions/4064042/how-could-i-make-a-search-match-for-similar-words) – Shaido Jun 27 '17 at 02:45
  • This is not a dupe @Shaido – eliasah Jun 27 '17 at 07:22
  • Unfortunately your question isn't very specific and as if it sounds that you are asking for a tutorial which is off topic on SO. You ought trying something, failing and post a more specific question so we can help you ! I'm voting to close it at the moment for being that reason. – eliasah Jun 27 '17 at 07:24

2 Answers2

1
  1. Does LDA ignore the order of words in a document? YES
  2. Does it also ignore the order of Documents in the corpus? YES

LDA model outputs 2 distributions(as 2 matrices): document-topic distribution and topic-word distribution. In short words, you can transpose the topic-word matrix and calculate cosine similarity for each words

0

To answer your question - Yes, LDA can be used to return a list of similar words given a query word. The similarity in this case would refer to the co-occurrences between the words, i.e. if u is a similar word to v, it is likely that the probability P(u|v,d) is high, i.e. for any document d, it is likely that you would see u if you have already seen v.

Such statistical co-occurrences would be able to put words such as 'Obama', 'president' and 'USA' in the same group (equivalence class defined by the similarity relation).

The exact way you get similar words in LDA is to use the output phi matrix (a KxV matrix, K=#latent topics, V=#words). Each column vector of this matrix represents a word. Given a query word, get its vector and return a list of words whose vectors are most similar (inner-product) to the query one.

However, LDA won't be particularly a good choice to capture synonymy relations between terms, e.g. 'sun' and 'solar'. The use of word vector embedding is a particularly good choice in such a scenario.

The main difference of word vector with LDA is that the notion of similarity used in the former is more contextual. To be more precise, word vectors u and v are similar if they are both similar to their context vectors - other words in close proximity around these words. Coming back to the example, in both the contexts of words 'sun' and 'solar', you expect to see words such as 'star', 'planets', 'energy', 'heat' etc., which all contribute to the belief that 'sun' and 'solar' could be used synonymously.

Also from a practical view-point, using word vector embedding is a much better choice because the training is much faster as compared to LDA. Use the C implementation word2vec by Mikolov. It has a distance utility executable, which given a query word, would give you a list of words ranked by decreasing cosine similarity values with the query word.

Debasis
  • 3,680
  • 1
  • 20
  • 23