4

I am seeing mulitple questions and answers saying that NLTK collocation cannot be done beyond bi and tri grams.

example this one - How to get n-gram collocations and association in python nltk?

I am seeing that there is a something called

nltk.QuadgramCollocationFinder

Similar to

nltk.BigramCollocationFinder and nltk.TrigramCollocationFinder

But at the same time cannot see anything like

nltk.collocations.QuadgramAssocMeasures()

similar to nltk.collocations.BigramAssocMeasures() and nltk.collocations.TrigramAssocMeasures()

What is the purpose of nltk.QuadgramCollocationFinder if its not possible (without hacks) to find n-grams beyond bi and tri grams.

Maybe I am missing something.

Thanks,

Adding in the code and updating the question as per input from Alvas, this now works

import nltk
from nltk.collocations import *
from nltk.corpus import PlaintextCorpusReader
from nltk.metrics.association import QuadgramAssocMeasures

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
quadgram_measures = QuadgramAssocMeasures()

the_filter = lambda *w: 'crazy' not in w

finder = BigramCollocationFinder.from_words(corpus)
finder.apply_freq_filter(3)
finder.apply_ngram_filter(the_filter)
print (finder.nbest(bigram_measures.likelihood_ratio, 10))


finder = QuadgramCollocationFinder.from_words(corpus)
finder.apply_freq_filter(3)
finder.apply_ngram_filter(the_filter)
print(finder.nbest(quadgram_measures.likelihood_ratio,10))
Community
  • 1
  • 1
Kumar
  • 1,017
  • 1
  • 11
  • 16
  • Update your NLTK `pip install -U nltk`, you should be able to get QuadgramAssocMeasures with `from nltk.metrics.association import QuadgramAssocMeasures` https://github.com/nltk/nltk/blob/develop/nltk/metrics/association.py#L298 – alvas Dec 11 '15 at 19:45
  • thanks a lot Sire ! this now works. Did not have to do the pip install, assuming I already had it. Why is everyone saying though that beyond trigrams does not work ? NLTK got updated with Quadgrams since the other questions on stackoverflow maybe and now NLTK has Quadgrams also? – Kumar Dec 12 '15 at 00:37
  • Sire is a little too much for me, call me `alvas` would do ;P . Yes, NLTK has been hugely improved in the past 2-3 years. The `QuadgramCollocationFinder` and the `QuadgramAssocMeasures` is somewhat new. But what the other answer from http://stackoverflow.com/questions/18672082/how-to-get-n-gram-collocations-and-association-in-python-nltk is trying to say, is that there is no simple solution to implement a general NgramCollocationFinder, the formula for `from_words(cls, words)` function is different for every order of ngram. – alvas Dec 12 '15 at 01:19
  • Take a look at the contingency table from trigram: https://github.com/nltk/nltk/blob/develop/nltk/metrics/association.py#L264 and now take a look at the quadgram: https://github.com/nltk/nltk/blob/develop/nltk/metrics/association.py#L321 As the order of ngram increases, the contingency table becomes more complex. And so does the marginal table: https://github.com/nltk/nltk/blob/develop/nltk/metrics/association.py#L350 – alvas Dec 12 '15 at 01:21
  • ok Sire Alvas -;) will call you just Alvas going forward...I will take a look at the github – Kumar Dec 13 '15 at 16:16

1 Answers1

3

From the repo:

from nltk.metrics.association import QuadgramAssocMeasures
alvas
  • 115,346
  • 109
  • 446
  • 738
  • thanks this works as per your suggestion. though can u please let me know why Bi and Trigrams measures are part of nltk.collocations and why QuadgramAssocMeasures are imported from nltk.metrics.association – Kumar Dec 12 '15 at 00:30
  • The reason why you can find `BigramAssocMeasures` in `nltk.collocations` is because of the import at https://github.com/nltk/nltk/blob/develop/nltk/collocations.py#L39. The true location of the `BigramAssocMeasures` is actually in `nltk.metrics.association`. So it's sort of a feature but not a bug. – alvas Dec 12 '15 at 01:25
  • No worries, in 1-2 weeks, the `QuadgramAssocMeasures` should be added to `nltk.collocations` too. There's other more important bugs to fix =) – alvas Dec 12 '15 at 01:26
  • so once QuadgramAssocMeasures move to nltk.collocations...will it work the way it is today, guessing that gets depreciated then ? thx again.. – Kumar Dec 13 '15 at 16:17
  • BTW,just observed that somewhere within the tokenization seems to be spitting punctuations too...so I see ("'", 's', 'really', 'helpful') instead of ("that's", 'really', 'helpful','information') – Kumar Dec 13 '15 at 16:21
  • The `from nltk.metrics.association import QuadgramAssocMeasures` code will still work after QuadgramAssocMeasures gets added to `nltk.collocations`. It will just be some namespace manipulation so nothing in the NLTK will get deprecated with respect to collocaitons/association measures. – alvas Dec 13 '15 at 16:22
  • another example is ('you', "'", 're', 'very') instead of ("you're", 'very','helpful','person')....and one other thing...I thought this removes stopwords...automatically somewhere within the code...but I saw results like ('a', 'helpful') – Kumar Dec 13 '15 at 16:25