33

I have a list of sentences:

text = ['cant railway station','citadel hotel',' police stn']. 

I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I did:

text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)

which yields

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

Can't railway station and citadel hotel form one bigram. What I want is

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

The last word of the first sentence should not merge with the first word of second sentence. What should I do to make it work?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Hypothetical Ninja
  • 3,920
  • 13
  • 49
  • 75

10 Answers10

56

Using list comprehensions and zip:

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]
butch
  • 2,178
  • 1
  • 17
  • 21
  • and if you want to keep each sentence's bigrams in its own list: `[[b for b in zip(l.split(" ")[:-(n-1)], l.split(" ")[(n-1):])] for l in x]` – Joe Jun 27 '20 at 20:49
  • This is a wonderful approach for the general case and solves the OP's question straightforwardly but it is also worth mentioning that it is sometimes useful to treat punctuation marks as separate words e.g. if the intent is to train an n-gram language model, in order to calculate the grammaticality of a sentence so .split(" ") may not be the ideal here. It may be best to use nltk.word_tokenize along with nltk.sent_tokenize instead – Ender Feb 28 '21 at 06:33
17
from nltk import word_tokenize 
from nltk.util import ngrams


text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
    token = word_tokenize(line)
    bigram = list(ngrams(token, 2)) 

    # the '2' represents bigram; you can change it to get ngrams with different size
ISE
  • 436
  • 2
  • 9
  • 21
gurinder
  • 171
  • 1
  • 4
9

Rather than turning your text into lists of strings, start with each sentence separately as a string. I've also removed punctuation and stopwords, just remove these portions if irrelevant to you:

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

To use it, do like so:

for line in sentence:
    features = get_bigrams(line)
    # train set here

Note that this goes a little further and actually statistically scores the bigrams (which will come in handy in training the model).

Dan
  • 4,488
  • 5
  • 48
  • 75
5

Without nltk:

ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]
Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129
3
>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text  for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

Using enumerate and split function.

Tanveer Alam
  • 5,185
  • 4
  • 22
  • 43
2

Read the dataset

df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")

Collect all available months

df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])

Create tokens of all tweets per month

tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))

Create bigrams per month

bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))

Count bigrams per month

count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))

Wrap up the result in neat dataframes

month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])
Syscall
  • 19,327
  • 10
  • 37
  • 52
avi
  • 21
  • 1
1

Just fixing Dan's code:

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result
Jay Marm
  • 556
  • 5
  • 12
1

Best way is to use "zip" function to generate the n-gram. Where 2 in range function is number of grams

test = [1,2,3,4,5,6,7,8,9]
print(test[0:])
print(test[1:])
print(list(zip(test[0:],test[1:])))
%timeit list(zip(*[test[i:] for i in range(2)]))

o/p:

[1, 2, 3, 4, 5, 6, 7, 8, 9]  
[2, 3, 4, 5, 6, 7, 8, 9]  
[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]  
1000000 loops, best of 3: 1.34 µs per loop  
0

There are a number of ways to solve it but I solved in this way:

>>text = ['cant railway station','citadel hotel',' police stn']
>>text2 = [[word for word in line.split()] for line in text]
>>text2
[['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
>>output = []
>>for i in range(len(text2)):
    output = output+list(bigrams(text2[i]))
>>#Here you can use list comphrension also
>>output
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]
saicharan
  • 1
  • 1
0

I think the best and most general way to do it is the following:

n      = 2
ngrams = []

for l in L:
    for i in range(n,len(l)+1):
        ngrams.append(l[i-n:i])

or in other words:

ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]

This should work for any n and any sequence l. If there are no ngrams of length n it returns the empty list.

Radio Controlled
  • 825
  • 8
  • 23