1

As I've stated in the title, I'm trying to calculate the phrase frequency of a given list of sequences that appear in in a list of strings. The problem is that the words in phrases do not have to appear next to the other ones, there may be one or more words in between.

Example:

Sequence: ('able', 'help', 'number') in a sentence "Please call us, we may be able to help, our phone number is 1234" 

I remove the stopwords (NLTK stopwords), remove punctuation, lowercase all letters and tokenize the sentence, so the processed sequence looks like ['please', 'call', 'us', 'able', 'help', 'phone', 'number', '1234']. I have about 30,000 sequences varying in length from 1 (single words) to 3, and I'm searching in almost 6,000 short sentences. My current approach is presented below:

from collections import Counter
from tqdm import tqdm
import nltk

# Get term sequency per sentence
def get_bow(sen, vocab):

    vector = [0] * len(vocab)
    tokenized_sentence = nltk.word_tokenize(sen)
    combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
                                                   itertools.combinations(tokenized_sentence, 2),
                                                   itertools.combinations(tokenized_sentence, 3)]))
    for el in combined_sentence:
        if el in vocab:
            cnt = combined_sentence.count(el)
            idx = vocab.index(el)
            vector[idx] = cnt
    return vector

sentence_vectors = []
for sentence in tqdm(text_list):
    sent_vec = get_bow
    sentence_vectors.append(get_bow(sentence, phrase_list))

phrase_list is a list of tuples with the sequences, text_list is a list of strings. Currently, the frequency takes over 1 hour to calculate and I'm trying to find more efficient way to get the list of frequencies associated with the given terms. I've also tried using sklearn's CountVectorizer, but there's a problem with processing sequences with gaps and they're not calculated at all.

I'd be grateful if anyone would try to give me some insight about how to make my script more efficient. Thanks in advance!

EDIT:

Example of phrase_list: [('able',), ('able', 'us', 'software'), ('able', 'back'), ('printer', 'holidays'), ('printer', 'information')]

Example of text_list: ['able add printer mac still working advise calling support team mon fri excluding bank holidays would able look', 'absolutely one cat coyote peterson', 'accurate customs checks cause delays also causing issues expected delivery dates changing', 'add super mario flair fridge desk supermario dworld bowsersfury magnet set available platinum points shipping costs mynintendo reward get', 'additional information though pass comments team thanks']

Expected output: [2, 0, 0, 1, 0] - a vector with occurrence count of each phrase, the order of values should be the same as in phrase_list. My code returns the vector of a phrase occurence per sentence, because I was trying to implement something like a bag-of-words.

maelstro
  • 11
  • 2

1 Answers1

0

There are many aspects that could be made faster, but here is the main problem:

combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
                                               itertools.combinations(tokenized_sentence, 2),
                                               itertools.combinations(tokenized_sentence, 3)]))

You generate all potential combinations of 1,2 or 3 words of the sentence. This is always bad, no matter what you want to do.

Sentence: "Master Yoda about sentence structure care does not."

  1. You really want to treat this sentence as if it contained "Yoda does not", then you should still not generate all combinations. There are much faster ways, but I will only spend time on this, if that indeed is your goal.
  2. If you would want to treat this sentence as a sentence that does NOT contain "Yoda does not", then I think you can figure out yourself how to speed up your code. Maybe look here.

I hope this helped. Let me know in case you need option 1.

KaPy3141
  • 161
  • 14