6

I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as follows:

import collections
import re
wanted = set(['inflation', 'gold', 'bank'])
cnt = collections.Counter()
words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())
for word in words:
    if word in wanted:
        cnt [word] += 1
print (cnt)

If possible, I would also like to count the number of times the phrases 'central bank' and 'high inflation' is used in this text. I appreciate any suggestion or guidance you can give.

falsetru
  • 357,413
  • 63
  • 732
  • 636
Raul
  • 83
  • 1
  • 6
  • do you want to find frequencies of [words bigrams](http://en.wikipedia.org/wiki/N-gram) in a text? – jfs Nov 12 '13 at 04:27
  • @ J.F. Sebastian, in a way, but specific ones, like the frequency of a phrase such as "high inflation rate." – Raul Nov 12 '13 at 07:01
  • related: [What are ngram counts and how to implement using nltk?](http://stackoverflow.com/q/12821201/4279) – jfs Nov 12 '13 at 13:15

3 Answers3

2

First of all, this is how I would generate the cnt that you do (to reduce memory overhead)

def findWords(filepath):
  with open(filepath) as infile:
    for line in infile:
      words = re.findall('\w+', line.lower())
      yield from words

cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))

Now, on to your question about phrases:

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))   
next(fw2)
for w1,w2 in zip(fw1, fw2)):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    cnt[phrase] += 1

Hope this helps

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • In python 3.3, you can use `yield from`. – falsetru Nov 12 '13 at 04:16
  • `phrase` become `True` or `False`. So `phrase in phrases` always yields False. – falsetru Nov 12 '13 at 04:17
  • 2
    This code does not yields what OP wants. Try the code with `the central bank high inflation` as contents of the file and with `central bank high inflation`. You may need to use something like `itertools.tee`. See `pairwise` recipe from [`itertools recipes`](http://docs.python.org/2/library/itertools.html#recipes). – falsetru Nov 12 '13 at 04:18
  • @falsetru: thanks for the bug report and the "`yield from`" comment. Please let me know if the changes help – inspectorG4dget Nov 12 '13 at 04:19
  • You're treating the first and second words as a phrase, and the third and fourth etc, but not the second and third – John La Rooy Nov 12 '13 at 04:25
  • @gnibbler: you're right! How did I miss that?! It's fixed now – inspectorG4dget Nov 12 '13 at 04:29
  • 1
    Now you're processing the file twice. Let me edit it to use `tee` – John La Rooy Nov 12 '13 at 04:34
  • Hey you guys, thanks for the quick responses. The code so far, however, counts the frequencies of all words and not of the phrases. I've checked, and its based on the first set of code: def findWords(filepath): with open(filepath) as infile: for line in infile: words = re.findall('\w+', line.lower()) yield from words cnt = collections.Counter(findWords('02.2003.BenBernanke.txt')) I'll try and see if an edit for that one works for the specific words – Raul Nov 12 '13 at 06:53
  • @Raul: there's two snippets of code. The second one deals with phrases – inspectorG4dget Nov 12 '13 at 06:56
  • @inspectorG4dget, but the second snippet depends on the first and for some reason, it still registers only words, not phrases. is it perhaps a glitch on the 3.3.2 version? – Raul Nov 12 '13 at 07:00
  • @Raul: are you sure? The second snippet should do both words and phrases – inspectorG4dget Nov 12 '13 at 07:06
  • @inspectorG4dget Unfortunately, yes, I've connected it to the first snippet and it returns the frequency of all the words in the text file. Is there a way to isolate the result to just the phrases? Thanks for all of the help. – Raul Nov 12 '13 at 07:29
  • @Raul: yeah, just replace `cnt = ...` with `cnt = collections.Counter()` in the first snippet – inspectorG4dget Nov 12 '13 at 07:35
1

To count literal occurrences of couple of phrases in a small file:

with open("input_text.txt") as file:
    text = file.read()
n = text.count("high inflation rate")

There is nltk.collocations module that provides tools to identify words that often appear consecutively:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder

# run nltk.download() if there are files missing
words = [word.casefold() for sentence in sent_tokenize(text)
         for word in word_tokenize(sentence)]
words_fd = nltk.FreqDist(words)
bigram_fd = nltk.FreqDist(nltk.bigrams(words))
finder = BigramCollocationFinder(word_fd, bigram_fd)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 5))
print(finder.score_ngrams(bigram_measures.raw_freq))

# finder can be constructed from words directly
finder = TrigramCollocationFinder.from_words(words)
# filter words
finder.apply_word_filter(lambda w: w not in wanted)
# top n results
trigram_measures = nltk.collocations.TrigramAssocMeasures()
print(sorted(finder.nbest(trigram_measures.raw_freq, 2)))
Costa
  • 461
  • 7
  • 13
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

Assuming the file is not huge - this is the easiest way

for w1, w2 in zip(words, words[1:]):
    phrase = w1 + " " + w2
    if phrase in wanted:
        cnt[phrase] += 1
print(cnt)
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • Hey gnibbler, thanks for all of the great insight! however, when I incorporate this part of the code with the first snippet above, it returns an error message indicating that 'words' is not recognized. Do you happen to know why that is? Thanks again for all of the help. – Raul Nov 12 '13 at 07:31
  • `words` is just the list of words from your question. The `for` loop combines pairs of words to create (two word) phrases – John La Rooy Nov 12 '13 at 09:27