0

I have spent unbelievable hours trying to hunt down a way to use itertools to transform a sentence into a list of two-word phrases.

I want to take this: "the quick brown fox"

And turn it into this: "the quick", "quick brown", "brown fox"

Everything I've tried brings be back everything from single-word to 4-word lists, but nothing returns just pairs.

I've tried a bunch of different uses of itertools combinations and I know it's doable but I simply cannot figure out the right combination and I don't want to define a function for something I know is doable in two lines of code or less. Can anyone please help me?

Rob M
  • 298
  • 1
  • 4
  • 18

3 Answers3

2

Try:

s = "the quick brown fox"
words = s.split()
result = [' '.join(pair) for pair in zip(words, words[1:])]
print(result)

Output

['the quick', 'quick brown', 'brown fox']

Explanation

Creating iterator for word pairs using zip

zip(words, words[1:]

Iterating over the pairs

for pair in zip(words, words[1:])

Creating resulting words

[' '.join(pair) for ...]
DarrylG
  • 16,732
  • 2
  • 17
  • 23
1

@DarrylG answer seems the way to go, but you can also use:

s = "the quick brown fox"
p  = s.split()
ns = [f"{w} {p[n+1]}" for n, w in enumerate(p) if n<len(p)-1 ]
# ['the quick', 'quick brown', 'brown fox']

Demo

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
1

If you want a pure iterator solution for large strings with constant memory usage:

input       = "the quick brown fox"
input_iter1 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))                                                                                                                     
input_iter2 = map(lambda m: m.group(0), re.finditer(r"[^\s]+", input))                                                                                                                     
next(input_iter2) # skip first
output = itertools.starmap(
    lambda a, b: f"{a} {b}", 
    zip(input_iter1, input_iter2)
)
list(output)                                                         
# ['the quick', 'quick brown', 'brown fox']

If you have an extra 3x string memory to store both the split() and doubled output as lists, then it might be quicker and easier not to use itertools:

inputs = "the quick brown fox".split(' ')    

output = [ f"{inputs[i]} {inputs[i+1]}" for i in range(len(inputs)-1) ] 
#  ['the quick', 'quick brown', 'brown fox']

Update

Generalized solution to support arbitrary ngram sizes:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

Test:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

Output:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

You may also find this question of relevance: n-grams in python, four, five, six grams?

James McGuigan
  • 7,542
  • 4
  • 26
  • 29