predicting next word using n-gram model NLTK

Question

I am trying to run the code for N-Gram Language Modelling with NLTK which is taken from https://www.geeksforgeeks.org/n-gram-language-modelling-with-nltk/. But it is throwing an error.

# generate frequency of n-grams
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)

d = defaultdict(Counter)
for a, b, c in freq_tri:
    if(a != None and b!= None and c!= None):
    d[a, b] += freq_tri[a, b, c]

The error I got was as below,

`AttributeError                            Traceback (most recent call last)
<ipython-input-12-ae7c0728f2d6> in <module>
      3     print(freq_tri[a,b,c])
      4     if(a != None and b!= None and c!= None):
----> 5       d[a, b] += freq_tri[a, b, c]
AttributeError: 'int' object has no attribute 'items' `

The entire code is available at the site

Take a look at https://www.kaggle.com/code/alvations/n-gram-language-model-with-nltk — alvas, Feb 28 '23 at 17:49

score 0 · Answer 1 · answered Feb 28 '23 at 18:15

The code on the geeksforgeeks is kinda outdated and lack a full working example =(

Lets walkthrough the code and go step-by-step instead of having some copy+paste solve it answer!

Download the data/model dependencies

import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')

Import the modules necessary to pre-rpcess the data from Reuters

import string
import random
from nltk.corpus import stopwords

# write the removal characters such as : Stopwords and punctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']

Read the Reuter corpus and collect the n-grams

The geeksforgeeks code hardcoded the ngrams, but there is a cool everygrams feature https://stackoverflow.com/a/54177775/610569:

from nltk.corpus import reuters
from nltk import FreqDist, ngrams, everygrams

sents = reuters.sents()[:30]

one_two_three_grams = everygrams(sents, 1, 3, pad_left=True, pad_right=True)

# Cleans up and remove stop words.
one_two_three_grams = [ng for ng in one_two_three_grams if all(word for word in ng if word not in removal_list)]

Picking words from the salad

from itertools import chain
from nltk.corpus import reuters
from nltk import FreqDist, ngrams, everygrams

sents = reuters.sents()

one_to_four_ngrams = chain(*[everygrams(sent, 1, 4, pad_left=True, pad_right=True) for sent in sents])
one_to_four_ngrams = [ng for ng in one_to_four_ngrams if all(word for word in ng if word not in removal_list)]


# Keep a counter of ngrams.
word_salad = FreqDist(one_to_four_ngrams)

# Given an input "prompt" / prefix
prefix = 'it will'

# Check what's most possible to come next:
print([ng for ng in word_salad if ' '.join(ng).lower().startswith(prefix.lower())])

[out]:

[('it', 'will'), ('It', 'will'), ('it', 'will', 'impose'), ('it', 'will', 'impose', '300'), ('it', 'will', 'mean'), ('it', 'will', 'mean', 'the'), ('it', 'will', 'be'), ('it', 'will', 'be', 'extended'), ('it', 'will', 'have'), ('it', 'will', 'have', 'on'), ('it', 'will', 'establish'), ('it', 'will', 'establish', 'a'), ('it', 'will', 'vastly'), ('it', 'will', 'vastly', 'expand'), ('It', 'will', 'be'), ('It', 'will', 'be', 'the'), ('it', 'will', 'also'), ('it', 'will', 'also', 'open'), ('It', 'will', 'only'), ('It', 'will', 'only', 'disappear'), ('It', 'will', 'remain'), ('It', 'will', 'remain', 'very'), ('it', 'will', 'not'), ('it', 'will', 'not', 'allow'), ('it', 'will', 'withdraw'), ('it', 'will', 'withdraw', 'the'), ('it', 'will', 'concentrate'), ('it', 'will', 'concentrate', 'on')]

What about some probabilities?

See https://www.kaggle.com/code/alvations/n-gram-language-model-with-nltk