To reduce the time complexity, we can increase the space complexity. Go through keywords
and hash them into a set(), assuming every keyword is unique (if not, duplicates will just be removed).
Then you can go through the paragraph
and create one, two, or three word phrases, checking for their presence and incrementing their counts as any of those phrases appear in the hashedKeywords
. Time complexity will be O(m+n) =~ O(n) but space complexity goes from O(1) to O(n).
import string # for removing punctuation
# Sample input with bigrams and trigrams in keywords
paragraphs = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
keywords = ['magna', 'lorem ipsum', 'sed do eiusmod', 'aliqua']
# Hash keywords into set for faster look up
hashedKeywords = set()
for keyword in keywords:
hashedKeywords.add(keyword)
# Strip punctuation from paragraph phrases using translate() and make it case insensitive using lower()
table = str.maketrans({key: None for key in string.punctuation})
wordsInParagraphs = [w.translate(table).lower() for w in paragraphs.split()]
# Initialize for loop
maxGram = 3
wordFrequency = {}
# Loop through words in paragraphs but also create a small list of one, two, or three word phrases.
for i in range(len(wordsInParagraphs)):
# List slicing ensures the last word and second to last word will produce a one and two string list, respectively (since slicing past the length of the list will simply return a list up to the last element in Python)
phrases = wordsInParagraphs[i:i+maxGram] # e.g. ['lorem', 'ipsum', 'dolor']
# Loop through the one, two, and three word phrases and check if phrase is in keywords
for j in range(len(phrases)):
phrase = ' '.join(phrases[0:j+1]) # Join list of strings into a complete string e.g. 'lorem', 'lorem ipsum', and 'lorem ipsum dolor'
if phrase in hashedKeywords:
wordFrequency.setdefault(phrase , 0)
wordFrequency[phrase] += 1
print(wordFrequency)
Output:
{'lorem ipsum': 1, 'sed do eiusmod': 1, 'magna': 1, 'aliqua': 1}
Note: This is in Python 3. If in Python 2 and wish to remove punctuation, see this answer.