Break string into words and phrases

Question

Supposing I have a string with several space-separated words, like

words = "foo bar baz qux"

If I want a list of the words, I can just call words.split() and get

['foo','bar','baz','qux']

But if I want to get each word and each set of (adjacent) words, like

['foo bar baz qux', 'foo bar baz', 'bar baz qux', 
'foo bar', 'bar baz', 'baz qux', 'foo', 'bar',
'baz', 'qux']

How can I go about this? I'm sure I can write a big ugly function that takes a string like words and iterates over each set of adjacent elements to return the above, but I've a hunch there's a more elegant way to go about it. Is there?

score 1 · Accepted Answer · edited May 23 '17 at 11:43

Pretty "ugly" and with `itertools`:

Combining "Find all consecutive sub-sequences of length n in a sequence" and "concatenating sublists python":

from itertools import chain

words = "foo bar baz qux"

w = words.split()
print map(' '.join, chain.from_iterable(zip(*(w[i:] for i in range(i))) for i in range(1, len(w) + 1)))

Output:

['foo', 'bar', 'baz', 'qux', 'foo bar', 'bar baz', 'baz qux', 'foo bar baz', 'bar baz qux', 'foo bar baz qux']

Not so ugly and pure Python:

I found a pretty short solution - although it has two nested for-loops.

print [' '.join(w[i:j+1]) for i in range(len(w)) for j in range(i, len(w))]

Output:

['foo', 'foo bar', 'foo bar baz', 'foo bar baz qux', 'bar', 'bar baz', 'bar baz qux', 'baz', 'baz qux', 'qux']

I actually like the pure Python route best, which I did not expect. And two for loops isn't going to be very problematic for my use case. — jgysland, Mar 08 '15 at 22:46

score 0 · Answer 2 · answered Mar 08 '15 at 19:21

0

You could use the nltk library - which is for natural language processing. e.g.

from nltk.util import ngrams
sentence = 'foo bar baz qux'

adj = [3, 2, 1]
for n in adj:
    print ngrams(sentence.split(), n)

answered Mar 08 '15 at 19:21

wrdeman

810
10
23

I've been looking for a reason to dig into nltk, but this (and a couple variations I tried) doesn't produce the desired result. :-( – jgysland Mar 08 '15 at 22:44

score 0 · Answer 3 · answered Mar 08 '15 at 19:29

The first principles approach (i.e., without needing to import anything) is indeed "ugly", but not too "big", really...

list = ['foo','bar','baz','qux']
length = len(list)
newlist = []
for item in list:
    string = item
    newlist.append(item)
    # assuming we're not on the last element, there's more strings to add starting with this
    startfrom = list.index(item) + 1
    for i in range(startfrom, length):
        string = string + ' ' + list[i]
        newlist.append(string)

print newlist

Result

['foo', 'foo bar', 'foo bar baz', 'foo bar baz qux', 'bar', 'bar baz', 'bar baz qux', 'baz', 'baz qux', 'qux']

Break string into words and phrases

3 Answers3

Pretty "ugly" and with itertools:

Not so ugly and pure Python:

Pretty "ugly" and with `itertools`: