0

I am trying to get a phrase count from a text file but so far I am only able to obtain a word count (see below). I need to extend this logic to count the number of times a two-word phrase appears in the text file.

Phrases can be defined/grouped by using logic from NLTK from my understanding. I believe the collections function is what I need to obtain the desired result, but I'm not sure how to go about implementing it from reading the NLTK documentation. Any tips/help would be greatly appreciated.

import re
import string
frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1

frequency_list = frequency.keys()

for words in frequency_list:
    print (words, frequency[words])
bhat557
  • 105
  • 2
  • 11
  • Are you looking for two specific words? Or just any two word phrases that appear together? – accraze Sep 25 '16 at 22:06
  • Any two words that appear together – bhat557 Sep 25 '16 at 22:31
  • 1
    Are you looking for [nltk.bigrams()](http://www.nltk.org/api/nltk.html#nltk.util.bigrams)? – alexis Sep 25 '16 at 22:56
  • Yes, Could I do something like: import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = BigramCollocationFinder.from_words( nltk.corpus.genesis.words('Words.txt'))? – bhat557 Sep 25 '16 at 23:18

2 Answers2

0

You can get all the two word phrases using the collocations module. This tool identifies words that often appear consecutively within corpora.

To find the two word phrases you need to first calculate the frequencies of words and their appearance in the context of other words. NLTK has a BigramCollocationFinder class that can do this. Here's how we can find the Bigram Collocations:

import re
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

finder = BigramCollocationFinder.from_words(match_pattern)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 2))

NLTK Collocations Docs: http://www.nltk.org/api/nltk.html?highlight=collocation#module-nltk.collocations

accraze
  • 1,522
  • 7
  • 20
  • 46
  • Thank you! When I try to pass my txt file into the finder function though, it simply prints out "[('W', 'o'), ('d', 's')]". Is there something I need to do to my txt file before passing it into the finder? Wasn't clear from the documention . – bhat557 Sep 26 '16 at 00:05
  • I updated the code in my answer, I believe you would need to pass `match_pattern` to the finder instead – accraze Sep 26 '16 at 04:22
0

nltk.brigrams returns a pair of words and their frequency in an specific text. Try this:

import nltk
from nltk import bigrams

document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
tokens = word_tokenize(text_string)
result = bigrams(tokens)

Output:

[(('w1', 'w2'), 6), (('w3', 'w4'), 3), (('w5', 'w6'), 3), (('w7', 'w8'), 3)...]
estebanpdl
  • 1,213
  • 1
  • 12
  • 31