11

I am trying to calculate the perplexity for the data I have. The code I am using is:

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

But I am receiving the error,

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).

My unigrams and their probability looks like:

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.

I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!

Mazdak
  • 105,000
  • 18
  • 159
  • 188
Ana_Sam
  • 469
  • 2
  • 4
  • 12
  • You first said you want to calculate the perplexity of a unigram model on a text corpus. But now you edited out the word unigram. – Omid Oct 21 '15 at 20:50
  • The sample code from nltk is itself not working :( Here in the sample code it is a trigram and I would change it to a unigram if it works. How to get past this error? – Ana_Sam Oct 21 '15 at 20:52
  • Do you have to use NLTK? – Omid Oct 21 '15 at 20:54
  • Not particular about NLTK. I just felt it was easier to use as am a newbie to programming. Is there any other way or package that I can use to estimate the perplexity for the data (which is not brown corpus) I have? – Ana_Sam Oct 21 '15 at 20:56
  • Of course there is. I am going to assume you have a simple text file from which you want to construct a unigram language model and then compute the perplexity for that model. Right? – Omid Oct 21 '15 at 21:00
  • No, I already performed LDA for the data I had and I have the unigrams generated. That is done. For those unigrams, I need to calculate the perplexity. The format of my data is Negroponte 1.22948976891e-05 Andreas 7.11290670484e-07 Rheinberg 7.08255885794e-07 Joji 4.48481435106e-07; That is I have the unigrams and their respective normalized distributions. – Ana_Sam Oct 21 '15 at 21:05

2 Answers2

22

Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:

enter image description here

Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

UPDATE:

As you asked for a complete working example, here's a very simple one.

Suppose this is our corpus:

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

Here's how we construct the unigram model first:

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

Now we can test this on two different test sets:

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

for which you get the following result:

>>> 
49.09452736318415
99.99999999999997

Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.

Omid
  • 2,617
  • 4
  • 28
  • 43
  • Can you please give a sample input for the above code and give it's output as well? It will be easier for me to formulate my data accordingly. I have edited the question by adding the unigrams and their probabilities I have in my input file for which the perplexity should be calculated. – Ana_Sam Oct 21 '15 at 21:25
  • Hey! But, I have to include the log likelihood as well like, perplexity (test set) = exp{- (Loglikelihood/count of tokens)} ? http://qpleple.com/perplexity-to-evaluate-topic-models/ – Ana_Sam Oct 21 '15 at 21:40
  • Thank you so much for the time and the code. I will try it out. I have to compute the perplexity for the unigrams that were produced by the LDA model. I guess for the data I have I can use this code and check it out. Thanks a ton! – Ana_Sam Oct 21 '15 at 22:57
  • 2
    Isn't there a mistake in the construction of the model in the line `model[word] = model[word]/float(len(model))` - shouldn't that say `model[word] = model[word]/float(sum(model.values()))`? – mknaf Jan 06 '17 at 16:59
  • 1
    in this line `model[word]/float(sum(model.values()))`, it computes sum(model.values()) each time after the normalised model values updates. Due to this, the sum of the normalised values are not 1, but 3.4. the sum has to be calculated once and used inside the for loop. @heiner was indeed right, I dont see where it is answered. – chmodsss Apr 25 '19 at 12:50
0

Thanks for the code snippet! Shouldn't:

for word in model:
        model[word] = model[word]/float(sum(model.values()))

be rather:

v = float(sum(model.values()))
for word in model:
        model[word] = model[word]/v

Oh ... I see was already answered ...

Heiner
  • 11
  • 1
  • Hi Heiner, welcome to SO, as you've already noticed this question has a well received answer from a few years ago, there's no problem with adding more answers to already-answered questions but you may want to make sure they're adding enough value to warrant them, in this case you may want to consider focusing on answering [these new questions](https://stackoverflow.com/questions/tagged/python?sort=newest) instead! – colsw Jan 17 '18 at 16:22