I have several gensim models fit to ~5 million documents. I'd like to pull the top 100 most representative documents from each of models for each topic to help me pick the best model.
Let's say I have a model lda
and corpus corpus
, I can get the topic probabilities in the following form:
topic_probs = lda[corpus]
Where topic_probs
is a list of tuples: (topic_num, topic_prob)
.
How can I sort this list of tuples by topic, and then probability, then retrieve the top 100 documents from the corpus? I'm guessing the answer looks something like the method for assigning topics here, but I'm struggling with how to work with a list of tuples while maintaining the document indices.
(This is somewhat complicated by the fact that I didn't know about the minimum_probability
argument to gensim.LdaModel
, so topics with < 0.01 probability are suppressed. These models take 2-3 days to run each, so I'd like to avoid re-running them if possible).