2

I thought this may have been discussed before, but somehow I couldn't find answers, so here it is.

Below are the topics generated using gensim lsi from some customer survey. My questions are:

  1. what does the minus and plus signs in front of the words mean?
  2. here I generated 5 topics and I could have generated more. how do I determine what might be the optimal number of topics? for example, maybe statistically after the third topic everything else will just be trivial.

Any suggestions are appreciated.

0.527*"interest" + 0.475*"lower" + 0.376*"rates" + 0.338*"rate" + 0.324*"good" + 0.257*"service" 0.671*"good" + 0.586*"service" + -0.254*"interest" + -0.251*"lower" + -0.159*"rate" + -0.150*"rates" 0.600*"great" + 0.351*"easy" + 0.337*"rewards" + 0.242*"use" + -0.167*"service" + 0.160*"like" -0.503*"rates" + 0.499*"rate" + -0.39*"great" + 0.364*"high" + -0.289*"lower" + 0.167*"easy" -0.608*"great" + 0.362*"easy" + -0.303*"rate" + 0.275*"rates" + 0.244*"use" + -0.227*"high"

Larry H.
  • 33
  • 4

1 Answers1

2

The main mechanism behind LSI is singular value decomposition (SVD) on the term-document matrix (TDM). I won't go into very much detail here but you can read about SVD on wikipedia if you like.

The topics generated are linear combinations of terms. These linear combinations are chosen (using SVD) to create a 'low-rank approximation' of the TDM.

The magnitude of the weights on the words can be thought of as importance: how much they matter in approximating the original TDM. Or, more loosely, how important the topic is in describing the corpus upon which the TDM is based.

The signs of the weights are only important relative to one another (you could, for example, multiply everything by -1 and if you correctly re-interpret the linear combinations, you'll arrive at the same interpretation). If each document can be rated on the degree to which it has each topic, then the sign tells you which way the associated word pushes the document. For example, in the output you provided, documents with many appearances of the words 'interest' and 'rates' should be low in the second topic. Documents with many appearances of 'good' and 'service' on the other hand, should be high in the second topic.

As for determining the optimal number of topics, it is context specific, but it depends mostly on the size of corpus. Here are some general guidelines (taken from this answer):

As a general rule, fewer dimensions allow for broader comparisons of the concepts contained in a collection of text, while a higher number of dimensions enable more specific (or more relevant) comparisons of concepts. The actual number of dimensions that can be used is limited by the number of documents in the collection. Research has demonstrated that around 300 dimensions will usually provide the best results with moderate-sized document collections (hundreds of thousands of documents) and perhaps 400 dimensions for larger document collections (millions of documents). However, recent studies indicate that 50-1000 dimensions are suitable depending on the size and nature of the document collection.

Community
  • 1
  • 1
Jason
  • 884
  • 8
  • 28