Bigram in python

Question

I wanted to divide a sentence into bi-grams. For example:

"My name is really nice. This is so awesome."

-->

["My name","name is", "is really", "really nice.", "This is", "is so", "so awesome."]

Any help?

This is in no way related to the "Rolling or Sliding Window iterator in Python", inspectorG4dget. — Abhirup Ghosh, Sep 21 '14 at 14:05

Avinash Raj · Accepted Answer · 2014-09-21T14:16:32.033

You could do this through positive lookahead,

>>> import re
>>> s = "My name is really nice. This is so awesome."
>>> m = re.findall(r'(?=(\b\w+\b \S+))', s)
>>> m
['My name', 'name is', 'is really', 'really nice.', 'This is', 'is so', 'so awesome.']

Pattern Explanation:

(?=...) Lookaheads are zero-length assertions just like the start and end of line, and start and end of word. It won't consume characters in the string, but only assert whether a match is possible or not.
() Capturing group which was used to capture characters which matches the pattern present inside the ().
\b Word boundary. It matches between a word character and a non-word character.
\w+ Matches one or more word characters.
\S+ Matches the space and the following non-space characters.
findall function usually prints the characters inside the captured groups. If there is no capturing groups then it would print the matches. In our case it would prints the characters which was present inside the group index 1. To match overlapping characters, you need to put the pattern inside a lookahead.

It would be really nice if you could explain your answer sir — Abhirup Ghosh, Sep 21 '14 at 14:07

score 1 · Answer 2 · answered Sep 21 '14 at 13:55

1

def ngrams(words, n):
    return [words[i:i+n] for i in range(len(words)-n+1)]

Output:

In [67]: ngrams("My name is really nice".split(),2)
Out[67]: [['My', 'name'], ['name', 'is'], ['is', 'really'], ['really', 'nice']]

answered Sep 21 '14 at 13:55

inspectorG4dget

110,290
27
149
241

Mazdak · Answer 3 · 2014-09-21T14:03:21.023

0

First you can use split('.') for split sentences in your string again split every sentence then with zip() you can concatenate them!

>>> [' '.join(i) for s2 in s.split('.') for i in zip(s2.split(),s2.split()[1:])]
['My name', 'name is', 'is really', 'really nice', 'This is', 'is so', 'so awesome']
>>>

edited Sep 21 '14 at 14:03

answered Sep 21 '14 at 13:54

Mazdak

105,000
18
159
188

yes i edit and split the string with '.' first ! – Mazdak Sep 21 '14 at 14:04

Bigram in python

3 Answers3