2

I have a list of keywords and another longer string(2 or 3 pages). I want to figure out keywords which are present in list of keywords. e.g

Keywords = [k1, k2, k3 k4, k5, k6 k7 k8]
paragraphs = "This will be 2 to4 page article"

one simple way will be

present_keywords = [x for x in keywords if x in paragraphs]

Time complexity of above algorithm will be O(m*n) =~ O(n^2)

Another way I can create Heap of keywords list, time complexity: O(n log n) Then search for each word from paragraph in the Heap, time complexity will be O(n).

Note: The keywords are bi-grams, tri-grams as well so second approach will not work.

What will be an efficient way to achieve this?

Some of the keywords are n-grams

Many People have given solution without considering this constrain. e.g New York is one keyword. Splitting paragraph will split New and York as different word. Have Mentioned this in Note above as well.

Pramod Patil
  • 757
  • 2
  • 10
  • 26

1 Answers1

4

To reduce the time complexity, we can increase the space complexity. Go through keywords and hash them into a set(), assuming every keyword is unique (if not, duplicates will just be removed).

Then you can go through the paragraph and create one, two, or three word phrases, checking for their presence and incrementing their counts as any of those phrases appear in the hashedKeywords. Time complexity will be O(m+n) =~ O(n) but space complexity goes from O(1) to O(n).

import string # for removing punctuation

# Sample input with bigrams and trigrams in keywords
paragraphs = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
keywords = ['magna', 'lorem ipsum', 'sed do eiusmod', 'aliqua']

# Hash keywords into set for faster look up
hashedKeywords = set()
for keyword in keywords:
    hashedKeywords.add(keyword)

# Strip punctuation from paragraph phrases using translate() and make it case insensitive using lower()
table = str.maketrans({key: None for key in string.punctuation})
wordsInParagraphs = [w.translate(table).lower() for w in paragraphs.split()]

# Initialize for loop
maxGram = 3
wordFrequency = {}

# Loop through words in paragraphs but also create a small list of one, two, or three word phrases. 

for i in range(len(wordsInParagraphs)):
    # List slicing ensures the last word and second to last word will produce a one and two string list, respectively (since slicing past the length of the list will simply return a list up to the last element in Python)
    phrases = wordsInParagraphs[i:i+maxGram] # e.g. ['lorem', 'ipsum', 'dolor']

    # Loop through the one, two, and three word phrases and check if phrase is in keywords
    for j in range(len(phrases)):
        phrase = ' '.join(phrases[0:j+1]) # Join list of strings into a complete string e.g. 'lorem', 'lorem ipsum', and 'lorem ipsum dolor'
        if phrase in hashedKeywords:
            wordFrequency.setdefault(phrase , 0)
            wordFrequency[phrase] += 1
print(wordFrequency)

Output:

{'lorem ipsum': 1, 'sed do eiusmod': 1, 'magna': 1, 'aliqua': 1}

Note: This is in Python 3. If in Python 2 and wish to remove punctuation, see this answer.

Endyd
  • 1,249
  • 6
  • 12
  • 1
    I think this a funny inversion of the common approach to pre-process the (pre-known) list of key-words. – greybeard Jan 09 '19 at 21:47
  • 1
    @endyd You missed comment from the question. The keywords might be n-grams as well. e.g "New York". Splitting the paragraph will split the keyword to "New" and "York". So the above functionality will not solve the problem. Even "New" and "York"'s count will be matching, they might or might have came as consecutive. – Pramod Patil Jan 10 '19 at 07:06
  • Hi, sorry, I missed that as I didn't quite understand what a bigram was, but with your example, I get it. Just a two word keyword. I've edited the answer to include bigrams and trigrams. – Endyd Jan 11 '19 at 13:24
  • @greybeard True haha. In my edited answer to allow bigrams and trigrams, I've done it the better way of hashing keywords (since keywords list is most likely to be smaller than the article, and keyword hash may be used for multiple articles). Thanks for your input. – Endyd Jan 11 '19 at 13:38