0

I have several gensim models fit to ~5 million documents. I'd like to pull the top 100 most representative documents from each of models for each topic to help me pick the best model.

Let's say I have a model lda and corpus corpus, I can get the topic probabilities in the following form:

topic_probs = lda[corpus]

Where topic_probs is a list of tuples: (topic_num, topic_prob).

How can I sort this list of tuples by topic, and then probability, then retrieve the top 100 documents from the corpus? I'm guessing the answer looks something like the method for assigning topics here, but I'm struggling with how to work with a list of tuples while maintaining the document indices.

(This is somewhat complicated by the fact that I didn't know about the minimum_probability argument to gensim.LdaModel, so topics with < 0.01 probability are suppressed. These models take 2-3 days to run each, so I'd like to avoid re-running them if possible).

Sean Norton
  • 277
  • 1
  • 12

2 Answers2

2

I've managed to figure this out, though possibly not in the most efficient way possible, and I didn't have time to make it fully portable.

  1. I extract the by-document topic probabilities, and dump it to csv (my corpus is large and takes forever to process, so I wanted to have these saved). You can easily loop this over multiple models.
def lda_to_csv(model, outfile, corpus):
    '''This function takes a gensim lda model as input, and outputs a csv with topics probs by document'''
    topic_probs = model.get_document_topics(corpus) #get the list of topic probabilities by doc
    topic_dict = [dict(x) for x in topic_probs] #convert to dictionary to convert to data frame
    df = pd.DataFrame(topic_dict).fillna(0) #convert to data frame, fill topics < 0.01 as 0
    df.to_csv(outfile)
  1. Wrote a function that selects the n-largest probabilities for each topic, then extracts and returns the document given the original (not BoW) texts as a list. This assumes the column with document indices is named "docs". The data frame is the one created by the previous function, read back in from csv.
def get_best_docs(df, n, k, texts):
    '''Return the index of the n most representative documents from a list of topic responsibilities for each topic'''
    '''n is the number of douments you want, k is the number of topics in the model, the texts are the FULL texts used to fit the model'''
    #create column list to iterate over
    k_cols = range(0, k)

    #intialize empty list to hold results
    n_rep_docs = []

    #loop to extract documents for each topic
    for i in k_cols:
        inds = df.nlargest(n = n, columns = str(i))['docs'].astype(int).tolist()
        #use list comprehension to extract documents
        n_rep_docs.append([texts[ind] for ind in inds])

    return n_rep_docs
Sean Norton
  • 277
  • 1
  • 12
0

I just want to alert others who may find themselves on this page that I believe the code posted previously is incorrect. I haven't had time to locate the exact source of the error, but I used it in my work for a few weeks and obtained very jumbled results. The documents assigned as "top candidates" for each topic were incoherent and did not correspond with the topic's representative words. I recently started using this code instead and have obtained much more coherent results.

lapeliroja
  • 31
  • 3