Obtain consecutive substrings in python

Question

Given a n-gram word I want to get the consecutive substrings patterns from 'start to end' and 'end to start'.

For example, for the 4-gram computer supported machine translation I should get the following substrings.

from start to end: computer supported, computer supported machine
from end to start: machine translation, supported machine translation

for the 3-gram natural language processing, I should get natural language and language processing.

I have really large n-grams, so I am interested in knowing the quickest way of doing this!

The quickest or most efficient way will probably depend on how the input is stored before processing and how the output is to be stored after processing. — Galen, Dec 08 '17 at 04:20

score 0 · Accepted Answer · answered Dec 08 '17 at 04:17

You could split the n-gram into a list of grams and then join slices (see Understanding Python's slice notation):

ngram = "computer supported machine translation"
grams = ngram.split(" ")

# Start to end
for c in range(2, len(grams)):
    print(" ".join(grams[:c]))

# End to start
for c in range(2, len(grams)):
    print(" ".join(grams[-c:]))

score 0 · Answer 2 · answered Dec 08 '17 at 05:16

You should use a function and then just pass the ngram as parameter :

some piece of code borrowed from @Galen :

def count_grams(gram,sentence):
    grams = sentence.split(" ")

    words=[]
    for i in range(gram,len(grams)):
        start=[]
        start.append(" ".join(grams[:i]))
        words.append(start)
    for j in range(gram,len(grams)):
        end=[]
        end.append(" ".join(grams[-j:]))
        words.append(end)

    return words



print(count_grams(2,'computer supported machine translation'))
print(count_grams(2,'natural language processing'))

output:

[['computer supported'], ['computer supported machine'], ['machine translation'], ['supported machine translation']]
[['natural language'], ['language processing']]

If you don't want result in list you can use " ".join()

Obtain consecutive substrings in python

2 Answers2