How can I find a specific bigram using nltk in python?

Question

I am currently working with nltk.book iny Python and would like to find the frequency of a specific bigram. I know there is the bigram() function that gives you the most common bigrams in the text as in this code:

    >>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
    [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
    >>>

But what if I was searching for only a specific one like "wish for"? I couldn't find anything about that in the nltk documentation so far.

So you want the frequency of "wish for" ? Please add the expected output — Dani Mesejo, Nov 14 '20 at 15:19
@DaniMesejo Yes, the output should be something like "Wish for: 5". However, my question is now solved. :) — Jennifer, Nov 14 '20 at 16:00

Alexander L. Hayes · Accepted Answer · 2020-11-14T15:29:47.610

If you can return a list of tuples, you can use in:

>>> bgrms = [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>> ('more', 'is') in bgrms
True
>>> ('wish', 'for') in bgrms
False

Then if you're looking for the frequency of specific bigrams, it might be helpful to build a Counter:

from nltk import bigrams
from collections import Counter

bgrms = list(bigrams(['more', 'is', 'said', 'than', 'wish', 'for', 'wish', 'for']))

bgrm_counter = Counter(bgrms)

# Query the Counter collection for a specific frequency:
print(
  bgrm_counter.get(tuple(["wish", "for"]))
)

Output:

Finally, if you want to understand this frequency in terms of how many bigrams are possible, you could divide by the number of possible bigrams:

# Divide by the length of `bgrms`

print(
  bgrm_counter.get(tuple(["wish", "for"])) / len(bgrms)
)

Output:

0.2857142857142857

How can I find a specific bigram using nltk in python?

1 Answers1