1

I m studying compiler construction using python, I'm trying to create a list of all lowercased words in the text, and then produce BigramCollocationFinder, which we can use to find bigrams, which are pairs of words.

These bigrams are found using association measurement functions in the nltk.metrics package.

I'm practising from the "Python 3 Text Processing with NLTK 3 Cookbook" and I found this example code:

from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
words = [w.lower() for w in webtext.words('grail.txt')]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

I'm stuck at:

bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
likelihood_ratio, 4

Here it mean similarity ratio or what does it means in this code.

Any guidance in this matter would be highly appreciated.

Malekai
  • 4,765
  • 5
  • 25
  • 60
Mubeen Khan
  • 330
  • 2
  • 12

1 Answers1

1

I believe NLTK collocations for specific words should answer your question. It calculates the PMI first and returns the top 4 words which occurs very frequently in your corpus.

NMAK
  • 209
  • 2
  • 9
  • on which basis it return like similarity or on usage basis – Mubeen Khan Apr 24 '19 at 15:36
  • 2
    The likelihood is based on the usage basis within the corpus, for e.x. if the bigram "python definition" occurs more than "python function" in the corpus the likelihood value of "python definition" will be more than "python function".https://nlp.stanford.edu/fsnlp/promo/colloc.pdf section 5.3.4 has more information. https://stackoverflow.com/questions/48715547/how-to-interpret-python-nltk-bigram-likelihood-ratios also explains the same by John. – NMAK Apr 24 '19 at 15:56