I have thousands of phone calls on a daily basis converted from speech to text. I tried generating collocation data using the two options below
OPTION # 1
corpus.collocations(200,2)
OPTION # 2
bigram = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(corpus)
finder.apply_freq_filter(5)
my_bigrams = finder.nbest(bigram.pmi,200)
When I use option #1 I seem to be getting good data but the terms seem to be not very meaning full, example I get terms like "good morning", "good afternoon", "american express"...they are important terms but way too common in the phone calls.
option #2 seem to be getting better data..example..it gives me car make and models, names of cities..etc...
I was wondering is somehone has already used both these options and deciding to go either route and if yes what basis.
I do see some data from option1 that might be good...so am thinking of generating data using both the options..
Any thoughts please ?
*editing my question a bit more Based on what I have seen so far, I am mostly going to end up getting most of the results from option 2 and will merge it with some from option 1. I am wondering if someone can also shed some light on how the two work differently.