0

Given a n-gram word I want to get the consecutive substrings patterns from 'start to end' and 'end to start'.

For example, for the 4-gram computer supported machine translation I should get the following substrings.

  • from start to end: computer supported, computer supported machine
  • from end to start: machine translation, supported machine translation

for the 3-gram natural language processing, I should get natural language and language processing.

I have really large n-grams, so I am interested in knowing the quickest way of doing this!

  • The quickest or most efficient way will probably depend on how the input is stored before processing and how the output is to be stored after processing. – Galen Dec 08 '17 at 04:20

2 Answers2

0

You could split the n-gram into a list of grams and then join slices (see Understanding Python's slice notation):

ngram = "computer supported machine translation"
grams = ngram.split(" ")

# Start to end
for c in range(2, len(grams)):
    print(" ".join(grams[:c]))

# End to start
for c in range(2, len(grams)):
    print(" ".join(grams[-c:]))
Galen
  • 1,307
  • 8
  • 15
0

You should use a function and then just pass the ngram as parameter :

some piece of code borrowed from @Galen :

def count_grams(gram,sentence):
    grams = sentence.split(" ")

    words=[]
    for i in range(gram,len(grams)):
        start=[]
        start.append(" ".join(grams[:i]))
        words.append(start)
    for j in range(gram,len(grams)):
        end=[]
        end.append(" ".join(grams[-j:]))
        words.append(end)

    return words



print(count_grams(2,'computer supported machine translation'))
print(count_grams(2,'natural language processing'))

output:

[['computer supported'], ['computer supported machine'], ['machine translation'], ['supported machine translation']]
[['natural language'], ['language processing']]

If you don't want result in list you can use " ".join()

Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88