I'm trying to retrieve a list of bigrams with a specific frequency (i
).
I've managed to come up with two ways to do it and I am wondering which would be the most efficient. I first create a list of bigrams bg1
then use the nltk.FreqDist
method:
import nltk
from nltk import FreqDist
from nltk import bigrams
#setup data
from nltk.book import text1
#keep only alpha words / remove punctuation
alphlist = [w for w in list(text1) if w.isalpha()]
#create bigrams list
bg1 = bigrams(alphlist)
#create freqdist object
fdist1 = nltk.FreqDist(bg1)
Approach one uses the most_common
sort first:
for obj in fdist1.most_common():
if obj[1] == i:
print(obj)
Approach two parses fdist1
directly:
for obj in fdist1:
if fdist1[obj] == i:
print(obj, fdist1[obj])
Which approach is better and why?