Forming Bigrams of words in list of sentences with Python

Question

I have a list of sentences:

text = ['cant railway station','citadel hotel',' police stn'].

I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:

text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)

which yields

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

Can't railway station and citadel hotel form one bigram. What I want is

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

score 56 · Accepted Answer · answered Feb 18 '14 at 05:04

56

Using list comprehensions and zip:

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]

answered Feb 18 '14 at 05:04

butch

2,178
1
17
21

and if you want to keep each sentence's bigrams in its own list: `[[b for b in zip(l.split(" ")[:-(n-1)], l.split(" ")[(n-1):])] for l in x]` – Joe Jun 27 '20 at 20:49
This is a wonderful approach for the general case and solves the OP's question straightforwardly but it is also worth mentioning that it is sometimes useful to treat punctuation marks as separate words e.g. if the intent is to train an n-gram language model, in order to calculate the grammaticality of a sentence so .split(" ") may not be the ideal here. It may be best to use nltk.word_tokenize along with nltk.sent_tokenize instead – Ender Feb 28 '21 at 06:33

score 17 · Answer 2 · edited May 02 '22 at 09:07

17

from nltk import word_tokenize 
from nltk.util import ngrams


text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
    token = word_tokenize(line)
    bigram = list(ngrams(token, 2)) 

    # the '2' represents bigram; you can change it to get ngrams with different size

edited May 02 '22 at 09:07

ISE

436
2
9
21

answered Feb 19 '18 at 18:30

gurinder

171
1
4

Dan · Answer 3 · 2016-12-21T19:26:45.523

Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

To use it, do like so:

for line in sentence:
    features = get_bigrams(line)
    # train set here

Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).

`stemmer` changes `apple` to `appl` so I get `['appl basket']`. — dashesy, Sep 28 '17 at 19:15
but this is not by sentences, you should use ``from_documents``. — Thomas Decaux, Jun 27 '18 at 11:53

Nir Alfasi · Answer 4 · 2018-04-23T18:58:02.373

5

Without nltk:

ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

edited Apr 23 '18 at 18:58

answered Feb 18 '14 at 05:00

Nir Alfasi

53,191
11
86
129

are they bigrams by default? because i'll be needing them for spell correct. – Hypothetical Ninja Feb 18 '14 at 05:06
@Sword you can see that it generates only bigrams from the last line (before the print). Play with it, try different sentences and see for yourself ;) – Nir Alfasi Feb 18 '14 at 05:08

score 3 · Answer 5 · answered Feb 18 '14 at 06:21

>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text  for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

Using enumerate and split function.

score 2 · Answer 6 · edited May 09 '18 at 20:06

Read the dataset

df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")

Collect all available months

df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])

Create tokens of all tweets per month

tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))

Create bigrams per month

bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))

Count bigrams per month

count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))

Wrap up the result in neat dataframes

month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])

score 1 · Answer 7 · answered Oct 02 '16 at 20:34

Just fixing Dan's code:

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

score 1 · Answer 8 · answered Aug 27 '20 at 15:45

Best way is to use "zip" function to generate the n-gram. Where 2 in range function is number of grams

test = [1,2,3,4,5,6,7,8,9]
print(test[0:])
print(test[1:])
print(list(zip(test[0:],test[1:])))
%timeit list(zip(*[test[i:] for i in range(2)]))

o/p:

[1, 2, 3, 4, 5, 6, 7, 8, 9]  
[2, 3, 4, 5, 6, 7, 8, 9]  
[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]  
1000000 loops, best of 3: 1.34 µs per loop

score 0 · Answer 9 · answered Nov 28 '18 at 20:42

There are a number of ways to solve it but I solved in this way:

>>text = ['cant railway station','citadel hotel',' police stn']
>>text2 = [[word for word in line.split()] for line in text]
>>text2
[['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
>>output = []
>>for i in range(len(text2)):
    output = output+list(bigrams(text2[i]))
>>#Here you can use list comphrension also
>>output
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

score 0 · Answer 10 · answered Oct 21 '19 at 10:09

I think the best and most general way to do it is the following:

n      = 2
ngrams = []

for l in L:
    for i in range(n,len(l)+1):
        ngrams.append(l[i-n:i])

or in other words:

ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]

This should work for any n and any sequence l. If there are no ngrams of length n it returns the empty list.