3

Is there a way in python to map documents belonging to a certain topic. For example a list of documents that are primarily "Topic 0". I know there are ways to list topics for each document but how do I do it the other way around?

Edit:

I am using the following script for LDA:

    doc_set = []
    for file in files:
        newpath = (os.path.join(my_path, file)) 
        newpath1 = textract.process(newpath)
        newpath2 = newpath1.decode("utf-8")
        doc_set.append(newpath2)

    texts = []
    for i in doc_set:
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        stopped_tokens = [i for i in tokens if not i in stopwords.words()]
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        texts.append(stemmed_tokens)

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, random_state=0, id2word = dictionary, passes=1)
Eisenheim
  • 67
  • 9
  • Welcome to StackOverflow! Please take the time to read this post on how to [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) as well as how to provide a [minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and revise your question accordingly – yatu Sep 07 '20 at 19:26
  • Who silently deleted all my comments? – gojomo Sep 13 '20 at 18:04

1 Answers1

4

You've got a tool/API (Gensim LDA) that, when given a document, gives you a list of topics.

But you want the reverse: a list of documents, for a topic.

Essentially, you'll want to build the reverse-mapping yourself.

Fortunately Python's native dicts & idioms for working with mapping make this pretty simple - just a few lines of code - as long as you're working with data that fully fits in memory.

Very roughly the approach would be:

  • create a new structure (dict or list) for mapping topics to lists-of-documents
  • iterate over all docs, adding them (perhaps with scores) to that topic-to-docs mapping
  • finally, look up (& perhaps sort) those lists-of-docs, for each topic of interest

If your question could be edited to include more information about the format/IDs of your documents/topics, and how you've trained your LDA model, this answer could be expanded with more specific example code to build the kind of reverse-mapping you'd need.

Update for your code update:

OK, if your model is in ldamodel and your BOW-formatted docs in corpus, you'd do something like:

# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]

# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
    # ...get its topics...
    doc_topics = ldamodel.get_document_topics(doc_bow)
    # ...& for each of its topics...
    for topic_id, score in doc_topics:
        # ...add the doc_id & its score to the topic's doc list
        docs_per_topic[topic_id].append((doc_id, score))

After this, you can see the list of all (doc_id, score) values for a certain topic like this (for topic 0):

print(docs_per_topic[0])

If you're interested in the top docs per topic, you can further sort each list's pairs by their score:

for doc_list in docs_per_topic:
    doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)

Then, you could get the top-10 docs for topic 0 like:

print(docs_per_topic[0][:10])

Note that this does everything using all-in-memory lists, which might become impractical for very-large corpuses. In some cases, you might need to compile the per-topic listings into disk-backed structures, like files or a database.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I have edited my question to add the script I am using to run gensim LDA. Would you be able to look at it and suggest the code I can use? Thanks a lot. – Eisenheim Sep 13 '20 at 12:19
  • Thankyou very much, it works! Just a little problem though. I am unable to trace back to the documents in my folder using document IDs. Looks like gensim does not assign document ids in the same order as the documents in my folder. I have tried re-arranging documents in my folder by name/type/added/modified but it still doesnt line up with gensim – Eisenheim Sep 14 '20 at 10:31
  • If your documents are only identified by their filename/file-path, you will need to remember your own mapping of `doc_id` to original filepath. For example, you could extend your code to 1st, create another `list` at the top like `id_to_path = []`. Then, at the bottom inside your `for file in files:` loop, remember the file-paths in the same order as the docs are created, with `id_to_path.append[newpath)`. Then, at the end, you can look-up any `doc_id` inside `id_to_path` to find the original file. – gojomo Sep 14 '20 at 18:04
  • Thankyou. I have done what you said but struggling with figuring out how to look up a doc-id and output the corresponding path/filename – Eisenheim Sep 14 '20 at 19:50
  • What to do when your data does not fit memory ? This is my case unfortunately.. – Nina van Bruggen Jul 05 '22 at 14:14
  • If you have a question about a different topic, like ways to perform certain operations on a corpus that does not fit into memory, you should post that as a separate question, with full details about the size of your data, your system limits, and which specific operations you want to perform. – gojomo Jul 05 '22 at 20:36