nltk: how to get bigrams containing a specific word

Question

I am new to nltk, and would like to get the collocates of a specific word (e.g. "man") so that later I would filter them by frequency and sort them by PMI score.

Here is my trial code to retrieve the bigrams containing "man", but it returns an empty list:

>>> text = "hello, yesterday I have seen a man walking. On the other side there was another man yelling \"who are you, man?\""
>>> tokens = word_tokenize(text)
>>> finder = BigramCollocationFinder.from_words(tokens, window_size=5)
>>> filter_man = lambda w: "man" not in w
>>> finder.apply_word_filter(filter_man)
>>> finder.ngram_fd.items()
[(('have', 'seen'), 1), ((',', 'yesterday'), 1), (('on', 'the'), 1), (('I', 'have'), 1), (('of', 'another'), 1), (('walking', 'on'), 1), (('seen', 'a'), 1), (('hello', ','), 1), (('man', 'walking'), 1), (('side', 'of'), 1), (('the', 'opposite'), 1), (('a', 'man'), 1), (('opposite', 'side'), 1), (('another', 'man'), 1), (('yesterday', 'I'), 1)]
>>> finder.ngram_fd.items()
[]
>>>

What am I doing wrong?

Possible duplicate of [NLTK collocations for specific words](https://stackoverflow.com/questions/21165702/nltk-collocations-for-specific-words) — Georgy, Mar 09 '18 at 21:00

Moris Huxley · Accepted Answer · 2018-03-09T21:04:40.760

5

finder = BigramCollocationFinder.from_words(text.split())
word_filter = lambda w1, w2: "man" not in (w1, w2)
finder.apply_ngram_filter(word_filter)

bigram_measures = nltk.collocations.BigramAssocMeasures()
raw_freq_ranking = finder.nbest(bigram_measures.raw_freq, 10) #top-10

edited Mar 09 '18 at 21:04

answered Mar 09 '18 at 16:18

Moris Huxley

372
3
13

1

@ThanksBye nltk.collocations allows to use only Bigram and Trigram finder. – Moris Huxley Mar 09 '18 at 17:16
@MorisHuxley there is a Quadgram finder as well (and it is pretty straightforward to fork it and implement 5-grams etc.) http://www.nltk.org/_modules/nltk/collocations.html – Alex Parakhnevich Mar 11 '18 at 13:33
1

@ThanksBye btw, I am not sure you actually need collocation finders for your use case, there is a chance that `nltk.util.ngrams` function would suit you just fine - check it out – Alex Parakhnevich Mar 11 '18 at 13:34

nltk: how to get bigrams containing a specific word

1 Answers1

Linked