4

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

I have tried the below approaches -

RAKE: It is a Python based keyword extraction library and it failed miserably.

Tf-Idf: It has given me good keywords per document, but it is not able to aggregate them and find keywords that represent the whole group of documents. Also, just selecting top k words from each document based on Tf-Idf score won't help, right?

Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.

Can you please suggest some good approach (or elaborate on how to improve any of the above 3) to solve this problem? Thanks :)

Vini
  • 313
  • 1
  • 7
  • 21

4 Answers4

4

Although Latent Dirichlet allocation and Hierarchical Dirichlet Process are typically used to derive topics within a text corpus and then use these topics to classify individual entries, a method to derive keywords for the entire corpus can also be developed. This method benefits from not relying on another text corpus. A basic workflow would be to compare these Dirichlet keywords to the most common words to see if LDA or HDP is able to pick up on important words that are not included in the most common ones.

Before using the following codes, it’s generally suggested that the following is done for text preprocessing:

  1. Remove punctuation from the texts (see string.punctuation)
  2. Convert the string texts to "tokens" (str.split(‘ ’).lower() to individual words)
  3. Remove numbers and stop words (see stopwordsiso or stop_words)
  4. Create bigrams - combinations of words in the text that appear together often (see gensim.Phrases)
  5. Lemmatize tokens - converting words to their base forms (see spacy or NLTK)
  6. Remove tokens that aren’t frequent enough (or too frequent, but in this case skip removing the too frequent ones, as these would be good keywords)

These steps would create the variable corpus in the following. A good overview of all this with an explanation of LDA can be found here.

Now for LDA and HDP with gensim:

from gensim.models import LdaModel, HdpModel
from gensim import corpora

First create a dirichlet dictionary that maps the words in corpus to indexes, and then use this to create a bag of words where the tokens within corpus are replaced by their indexes. This is done via:

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

For LDA, the optimal number of topics needs to derived, which can be heuristically done through the method in this answer. Let's assume that our optimal number of topics is 10, and as per the question we want 300 keywords:

num_topics = 10
num_keywords = 300

Create an LDA model:

dirichlet_model = LdaModel(corpus=bow_corpus,
                           id2word=dirichlet_dict,
                           num_topics=num_topics,
                           update_every=1,
                           chunksize=len(bow_corpus),
                           passes=20,
                           alpha='auto')

Next comes a function to derive the best topics based on their average coherence across the corpus. First an ordered lists for the most important words per topic will be produced; then the average coherence of each topic to the whole corpus is found; and finally topics are ordered based on this average coherence and returned along with a list of the averages to be used later. The code for all this is as follows (includes the option to use HDP from below):

def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
    """
    Orders topics based on their average coherence across the corpus

    Parameters
    ----------
        dirichlet_model : gensim.models.type_of_model
        bow_corpus : list of lists (contains (id, freq) tuples)
        num_topics : int (default=10)
        num_keywords : int (default=10)

    Returns
    -------
        ordered_topics, ordered_topic_averages: list of lists and list
    """
    if type(dirichlet_model) == gensim.models.ldamodel.LdaModel:
        shown_topics = dirichlet_model.show_topics(num_topics=num_topics, 
                                                   num_words=num_keywords,
                                                   formatted=False)
    elif type(dirichlet_model)  == gensim.models.hdpmodel.HdpModel:
        shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
                                                   num_words=num_keywords,
                                                   formatted=False)
    model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
    topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 

    topics_per_response = [response for response in topic_corpus]
    flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]

    significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
    topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
                      for topic_num in significant_topics]

    topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]

    significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
    ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # limit for HDP

    ordered_topic_averages = [topic_averages[i] for i in topic_indexes_by_avg_coherence][:num_topics] # limit for HDP
    ordered_topic_averages = [a/sum(ordered_topic_averages) for a in ordered_topic_averages] # normalize HDP values

    return ordered_topics, ordered_topic_averages

Now to get a list of keywords - the most important words across the topics. This is done by subsetting the words (which again are ordered by significance by default) from each of the ordered topics based on their average coherence to the whole. To explain explicitly, assume that there are just have two topics, and the texts are 70% coherent to the first, and 30% to the second. Keywords could then be the top 70% of words from the first topic, and the top 30% from the second that have not already been selected. This is achieved via the following:

ordered_topics, ordered_topic_averages = \
    order_subset_by_coherence(dirichlet_model=dirichlet_model,
                              bow_corpus=bow_corpus, 
                              num_topics=num_topics,
                              num_keywords=num_keywords)

keywords = []
for i in range(num_topics):
    # Find the number of indexes to select, which can later be extended if the word has already been selected
    selection_indexes = list(range(int(round(num_keywords * ordered_topic_averages[i]))))
    if selection_indexes == [] and len(keywords) < num_keywords: 
        # Fix potential rounding error by giving this topic one selection
        selection_indexes = [0]
              
    for s_i in selection_indexes:
        if ordered_topics[i][s_i] not in keywords and ordered_topics[i][s_i] not in ignore_words:
            keywords.append(ordered_topics[i][s_i])
        else:
            selection_indexes.append(selection_indexes[-1] + 1)

# Fix for if too many were selected
keywords = keywords[:num_keywords]

The above also includes the variable ignore_words, which is a list of words that should not be included in the results.

For HDP the model follows a similar process to the above, except that num_topics and other arguments do not need to be passed in model creation. HDP derives optimal topics itself, but then these topics will need to be ordered and subsetted using order_subset_by_coherence to assure that the best topics are used for a finite selection. A model is created via:

dirichlet_model = HdpModel(corpus=bow_corpus, 
                           id2word=dirichlet_dict,
                           chunksize=len(bow_corpus))

It is best to test both LDA and HDP, as LDA can outperform based on the needs of the problem if a suitable number of topics is able to be found (this is still the standard over HDP). Compare the Dirichlet keywords to word frequencies alone, and hopefully what's generated is a list of keywords that are more related to the overall theme of the text, not simply the words that are most common.

Obviously selecting ordered words from topics based on percent text coherence doesn’t give an overall ordering of the keywords by importance, as some words that are very important in topics with less overall coherence will be selected later.

The process for using LDA to generate keywords for the individual texts within the corpus can be found in this answer.

0

Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3

import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files: 
        file_opened = open(file, "r")
        lines = file_opened.read().split("\n")
        for word in topWords: 
                if word in lines and wordsCount < 301:
                                print("I found %s" %word)
                                wordsCount += 1
        #Check Again wordsCount to close first repetitive instruction
        if wordsCount == 300:
                break
ricristian
  • 466
  • 4
  • 17
  • this answer does not answer the question "extract automatically". It's quite time-consuming to read 3000 documents and extract keywords individually. – Luca Foppiano Aug 30 '21 at 01:39
  • Well true indeed, but as I already mentioned if it is a one time action I don't believe that is that important if script takes 1 second or 1 minute ... And if my answer doesn't really help ... I can delete this. Would be this ok with you @LucaFoppiano ? Thanks – ricristian Sep 05 '21 at 10:02
  • I think there are seveal problem with your answer, because knowing the 300 words it the difficult tasks that is not known in advance. It's not clear what your script is trying to do actually ;-) because the topWords are already known.. – Luca Foppiano Sep 07 '21 at 02:06
-1
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
    open_file = open(file, "r")
    for line in open_file.readlines():
        raw_words = line.split()
        for word in raw_words:
            words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))

now take top 300 from sorted words, they are the words you want.

Awaish Kumar
  • 537
  • 6
  • 22
  • Thanks @Awaish, but I have tried this also. The results were very poor with this approach because the important terms only appear once or twice. If I try to sort and select Tf-idf terms based on frequency, a lot of common and irrelevant terms come up. – Vini Aug 28 '17 at 05:07
  • this solution implies that you know already the words that you are looking for. – Luca Foppiano Aug 30 '21 at 01:38
-1

Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.

import java.util.List;

/**
 * Class to calculate TfIdf of term.
 * @author Mubin Shrestha
 */
public class TfIdf {

    /**
     * Calculates the tf of term termToCheck
     * @param totalterms : Array of all the words under processing document
     * @param termToCheck : term of which tf is to be calculated.
     * @return tf(term frequency) of term termToCheck
     */
    public double tfCalculator(String[] totalterms, String termToCheck) {
        double count = 0;  //to count the overall occurrence of the term termToCheck
        for (String s : totalterms) {
            if (s.equalsIgnoreCase(termToCheck)) {
                count++;
            }
        }
        return count / totalterms.length;
    }

    /**
     * Calculates idf of term termToCheck
     * @param allTerms : all the terms of all the documents
     * @param termToCheck
     * @return idf(inverse document frequency) score
     */
    public double idfCalculator(List allTerms, String termToCheck) {
        double count = 0;
        for (String[] ss : allTerms) {
            for (String s : ss) {
                if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                }
            }
        }
        return 1 + Math.log(allTerms.size() / count);
    }
}
shiv
  • 477
  • 5
  • 17
  • Thanks @shiv. But I have already implemented Tf-Idf and I did it with Lucene (for faster processing). The problem is Tf-Idf gives you "important terms" per document and not over the whole set of documents. – Vini Aug 28 '17 at 05:03