Create Consecutive Two Word Phrases from String

Question

I have spent unbelievable hours trying to hunt down a way to use itertools to transform a sentence into a list of two-word phrases.

I want to take this: "the quick brown fox"

And turn it into this: "the quick", "quick brown", "brown fox"

Everything I've tried brings be back everything from single-word to 4-word lists, but nothing returns just pairs.

I've tried a bunch of different uses of itertools combinations and I know it's doable but I simply cannot figure out the right combination and I don't want to define a function for something I know is doable in two lines of code or less. Can anyone please help me?

"_Everything I've tried_" : such as? What's the criteria to duplicate the word? — Pedro Lobito, May 01 '20 at 21:11
What you are requesting seems like bigrams. You could also read this answer: https://stackoverflow.com/a/17547860/6025629 — Mike Xydas, May 01 '20 at 21:35

DarrylG · Accepted Answer · 2020-05-01T21:23:17.553

2

Try:

s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)

Output

['the quick', 'quick brown', 'brown fox']

Explanation

Creating iterator for word pairs using zip

zip(words, words[1:]

Iterating over the pairs

for pair in zip(words, words[1:])

Creating resulting words

[' '.join(pair) for ...]

edited May 01 '20 at 21:23

answered May 01 '20 at 21:19

DarrylG

16,732
2
17
23

That's excellent, and I don't even have to use itertools. Thank you! – Rob M May 01 '20 at 22:10

Pedro Lobito · Answer 2 · 2020-05-01T21:27:53.293

1

@DarrylG answer seems the way to go, but you can also use:

s = "the quick brown fox"
p  = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']

Demo

edited May 01 '20 at 21:27

answered May 01 '20 at 21:19

Pedro Lobito

94,083
31
258
268

James McGuigan · Answer 3 · 2020-05-02T00:13:50.687

If you want a pure iterator solution for large strings with constant memory usage:

input       = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))                                                                                                                     
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))                                                                                                                     
next(input_iter2) # skip first
output = itertools.starmap(
    lambda a, b: f"{a} {b}", 
    zip(input_iter1, input_iter2)
)
list(output)                                                         
# ['the quick', 'quick brown', 'brown fox']

If you have an extra 3x string memory to store both the split() and doubled output as lists, then it might be quicker and easier not to use itertools:

inputs = "the quick brown fox".split(' ')    

output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ] 
#  ['the quick', 'quick brown', 'brown fox']

Update

Generalized solution to support arbitrary ngram sizes:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

Test:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

Output:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

You may also find this question of relevance: n-grams in python, four, five, six grams?

Create Consecutive Two Word Phrases from String

3 Answers3

Update