0

I'm trying to build n-grams which don't cross a period symbol. Split() only works for functions and list[index] only works with an index. Is there a way to access/split/divide a list by giving it a string/an element? Here is a snippet of my current function:

text = ["split","this","stuff",".","my","dear"]

def generate_ngram(rawlist, ngram_order):
        """
        Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
        Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
        """

    list_of_tuples = []
    for i in range(0, len(rawlist) - ngram_order + 1):
        ngram_order_index = i + ngram_order    
        generated_ngram = rawlist[i : ngram_order_index]

        #if "." in generated_ngram:
            #generated_ngram . . . 

        generated_tuple = tuple(generated_ngram)  
        list_of_tuples.append(generated_tuple)

    return set(list_of_tuples)

generate_ngram(text,3)

currently returns:

{('.', 'my', 'dear'),
 ('stuff', '.', 'my'),
 ('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

but it should ideally return:

{('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

Any idea on how to achieve this? Thanks for your help!

Lisa
  • 35
  • 2
  • 8
  • There are many words which are not in list appeared in your output. – Sociopath Feb 26 '19 at 12:25
  • 1
    Please review your examples and try to explain a bit further what do you want it to do. The documentation in the function seems to suggest you are trying to build n-grams. However, the outputs that you say you expect have different sizes. Do you want to build n-grams that do not cross a period symbol? – jdehesa Feb 26 '19 at 12:30
  • @jdehesa, thank you for your recommendations. I tried to adapt my documentation. Sorry, first time posting here! Yes, I indeed mean building n-grams that don't cross a period symbol/sentence border. – Lisa Feb 26 '19 at 12:35

1 Answers1

2

I'm not sure if this is exactly what you need, but this function generates ngrams that can only contain stop words (in this case period) at the end:

STOPWORDS = {"."}

def generate_ngram(rawlist, ngram_order):
    # All ngrams
    ngrams = zip(*(rawlist[i:] for i in range(ngram_order)))
    # Generate only those ngrams that do not contain stop words before the end
    return (ngram for ngram in ngrams if not any(w in STOPWORDS for w in ngram[:-1]))

text = ["split", "this", "stuff", ".", "my", "dear"]
print(*generate_ngram(text, 3), sep="\n")
# ('split', 'this', 'stuff')
# ('this', 'stuff', '.')
print(*generate_ngram(text, 2), sep="\n")
# ('split', 'this')
# ('this', 'stuff')
# ('stuff', '.')
# ('my', 'dear')

Note this function returns a generator. You can convert it to a list wrapping it with list(...) if you want, or you can directly iterate over it.

EDIT: You may find the equivalent syntax below more readable.

def generate_ngram(rawlist, ngram_order):
    # Iterate over all ngrams
    for ngram in zip(*(rawlist[i:] for i in range(ngram_order))):
        # Yield only those not containing stop words before the end
        if not any(w in STOPWORDS for w in ngram[:-1]):
            yield ngram
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • That's precisely what I needed! Thank you so much. – Lisa Feb 26 '19 at 12:44
  • @Lisa Glad it helped. I have added a syntax variation that you may find more readable. Please consider marking the answer as accepted if you feel it solved your question. Note, by the way, this method assumes the input is a sequence, like a list or a tuple, if it is another kind of iterable, like a generator, then `zip(*(rawlist[i:] for i in range(ngram_order)))` would not work - you may look at [Rolling or sliding window iterator?](https://stackoverflow.com/q/6822725) for alternatives to that line. – jdehesa Feb 26 '19 at 12:53