-1

I wanted to divide a sentence into bi-grams. For example:

"My name is really nice. This is so awesome." 

-->

["My name","name is", "is really", "really nice.", "This is", "is so", "so awesome."]

Any help?

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
Abhirup Ghosh
  • 29
  • 1
  • 7

3 Answers3

1

You could do this through positive lookahead,

>>> import re
>>> s = "My name is really nice. This is so awesome."
>>> m = re.findall(r'(?=(\b\w+\b \S+))', s)
>>> m
['My name', 'name is', 'is really', 'really nice.', 'This is', 'is so', 'so awesome.']

Pattern Explanation:

  • (?=...) Lookaheads are zero-length assertions just like the start and end of line, and start and end of word. It won't consume characters in the string, but only assert whether a match is possible or not.
  • () Capturing group which was used to capture characters which matches the pattern present inside the ().
  • \b Word boundary. It matches between a word character and a non-word character.
  • \w+ Matches one or more word characters.
  • \S+ Matches the space and the following non-space characters.
  • findall function usually prints the characters inside the captured groups. If there is no capturing groups then it would print the matches. In our case it would prints the characters which was present inside the group index 1. To match overlapping characters, you need to put the pattern inside a lookahead.
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
1
def ngrams(words, n):
    return [words[i:i+n] for i in range(len(words)-n+1)]

Output:

In [67]: ngrams("My name is really nice".split(),2)
Out[67]: [['My', 'name'], ['name', 'is'], ['is', 'really'], ['really', 'nice']]
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
0

First you can use split('.') for split sentences in your string again split every sentence then with zip() you can concatenate them!

>>> [' '.join(i) for s2 in s.split('.') for i in zip(s2.split(),s2.split()[1:])]
['My name', 'name is', 'is really', 'really nice', 'This is', 'is so', 'so awesome']
>>> 
Mazdak
  • 105,000
  • 18
  • 159
  • 188