2

I have thousands of phone calls on a daily basis converted from speech to text. I tried generating collocation data using the two options below

OPTION # 1

corpus.collocations(200,2)

OPTION # 2

bigram = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(corpus)
finder.apply_freq_filter(5)
my_bigrams = finder.nbest(bigram.pmi,200)

When I use option #1 I seem to be getting good data but the terms seem to be not very meaning full, example I get terms like "good morning", "good afternoon", "american express"...they are important terms but way too common in the phone calls.

option #2 seem to be getting better data..example..it gives me car make and models, names of cities..etc...

I was wondering is somehone has already used both these options and deciding to go either route and if yes what basis.

I do see some data from option1 that might be good...so am thinking of generating data using both the options..

Any thoughts please ?

*editing my question a bit more Based on what I have seen so far, I am mostly going to end up getting most of the results from option 2 and will merge it with some from option 1. I am wondering if someone can also shed some light on how the two work differently.

Naresh MG
  • 633
  • 2
  • 11
  • 19
  • Take a look at tf-idf: https://en.wikipedia.org/wiki/Tf%E2%80%93idf – alvas Jul 20 '16 at 01:22
  • Would it be possible to share the dataset? Otherwise, it would be hard to know which method is more appropriate for you. – alvas Jul 20 '16 at 01:23
  • @alvas I am afraid cannot share...but looking at the link u sent looks like option #2 already takes into consideration tfidf and hence is getting better data. pointwise mutual information. thank you for providing the link. – Naresh MG Jul 20 '16 at 21:58

0 Answers0