I'm trying to build n-grams which don't cross a period symbol. Split() only works for functions and list[index] only works with an index. Is there a way to access/split/divide a list by giving it a string/an element? Here is a snippet of my current function:
text = ["split","this","stuff",".","my","dear"]
def generate_ngram(rawlist, ngram_order):
"""
Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
"""
list_of_tuples = []
for i in range(0, len(rawlist) - ngram_order + 1):
ngram_order_index = i + ngram_order
generated_ngram = rawlist[i : ngram_order_index]
#if "." in generated_ngram:
#generated_ngram . . .
generated_tuple = tuple(generated_ngram)
list_of_tuples.append(generated_tuple)
return set(list_of_tuples)
generate_ngram(text,3)
currently returns:
{('.', 'my', 'dear'),
('stuff', '.', 'my'),
('split', 'this', 'stuff'),
('this', 'stuff', '.')}
but it should ideally return:
{('split', 'this', 'stuff'),
('this', 'stuff', '.')}
Any idea on how to achieve this? Thanks for your help!