0

I'm trying to retrieve a list of bigrams with a specific frequency (i).

I've managed to come up with two ways to do it and I am wondering which would be the most efficient. I first create a list of bigrams bg1 then use the nltk.FreqDist method:

import nltk
from nltk import FreqDist
from nltk import bigrams

#setup data
from nltk.book import text1

#keep only alpha words / remove punctuation
alphlist = [w for w in list(text1) if w.isalpha()]
#create bigrams list
bg1 = bigrams(alphlist)

#create freqdist object
fdist1 = nltk.FreqDist(bg1)

Approach one uses the most_common sort first:

for obj in fdist1.most_common():
  if obj[1] == i:
    print(obj)

Approach two parses fdist1 directly:

for obj in fdist1:
  if fdist1[obj] == i:
    print(obj, fdist1[obj]) 

Which approach is better and why?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Axle Max
  • 785
  • 1
  • 14
  • 23
  • 1
    Please try not to wipe out existing edits when adding more information to the question – jonrsharpe Oct 20 '16 at 21:41
  • Hi Jon. Apologies if I did that. I am completely new here. I honestly didn't know you were editing. Is there some way of knowing that someone else is editing at the same time as me? I'll do whatever is the right thing once I know how. :-) – Axle Max Oct 20 '16 at 21:44
  • 1
    #2 should be most efficient, since sorting is O(nlogn) and simply inspecting the elements directly is O(n). With that said, you're in the best position to answer this since you can time both methods while we can only theorize. See http://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python – Sohier Dane Oct 20 '16 at 21:46

0 Answers0