Count ngram word frequency using text collocations

Question

I would like to count the frequency of three words preceding and following a specific word from a text file which has been converted into tokens.

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
    text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)

I don't know how to search for the string 'dracula' as a filter word. I also tried:

text.collocations(num=100)
text.concordance('dracula')

The desired output would look something like this with counts: Three words preceding 'dracula', sorted count

(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)

Three words following 'dracula', sorted count

(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)

The trigram containing 'dracula' in the middle, sorted count

(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)

Thank you in advance for any help.

score 1 · Accepted Answer · answered Feb 01 '19 at 16:27

Once you get the frequency information in tuple format, as you've done, you can simply filter out the word you're looking for with if statements. This is using Python's list comprehension syntax:

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
    text_data = textfile.read().replace('\n', ' ').lower()
    # pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6

tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = nltk.Counter(grams)

dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
# etc.

This produces lists with "dracula" in different positions. Here is what dracula_last looks like:

[(('the', 'castle', 'of', 'dracula'), 3),
 (("'s", 'journal', '243', 'dracula'), 1),
 (('carpathian', 'moun-', '2', 'dracula'), 1),
 (('of', 'the', 'castle', 'dracula'), 1),
 (('named', 'by', 'count', 'dracula'), 1),
 (('disease', '.', 'count', 'dracula'), 1),
 ...]

Thanks, I just had to modify the last one and change the ngram tokens to 3 for it to become a trigram. — Mike Ninov, Feb 01 '19 at 19:30

Count ngram word frequency using text collocations

1 Answers1