Defaultdict(defaultdict) for text analysis

Question

Text read from file and cleaned up:

['the', 'cat', 'chased', 'the', 'dog', 'fled']

The challenge is to return a dict with each word as the value and the words that can follow it as the key and a count for the number of times it follows it:

{'the': {'cat': 1, 'dog': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}, 'dog': {'fled': 1}}

Collections.counter will count the frequency of each unique value. However, my algorithm to solve this challenge is long and unwieldy. How might defaultdict be used to make solving this more simple?

EDIT: here is my code to bruise through this problem. A flaw is that the values in the nested dict are the total number of times a word appears in the text, not how many times it actually follows that particular word.

from collections import Counter, defaultdict

wordsFile = f.read()
words = [x.strip(string.punctuation).lower() for x in wordsFile.split()]    
counter = Counter(words)

# the dict of [unique word]:[index of appearance in 'words']
index = defaultdict(list) 

# Appends every position of 'term' to the 'term' key
for pos, term in enumerate(words):
    index[term].append(pos)  

# range ends at len(index) - 2 because last word in text has no follower
master = {}
for i in range(0,(len(index)-2)):

    # z will hold the [index of appearance in 'words'] values
    z = []
    z = index.values()[i] 
    try:

        # Because I am interested in follower words
        z = [words[a+1] for a in z]
        print z; print

        # To avoid value errors if a+1 exceeds range of list
    except Exception:
        pass

    # For each word, build r into the dict that contains each follower word and its frequency.

    r = {}
    for key in z:
        r.update({key: counter[key]})

    master.update({index.keys()[i]:r})


return  master

Do you have working code that does this? If so your question may not be suitable for SO, but another site. — SuperBiasedMan, Jan 12 '16 at 16:34
If possible for your use case, Consider using a simpler data structure like `{("the", "cat"): 1, ("the", "dog"): 1, ("chased", "the"): 1}`. Then all you need to do is get a [list of consecutive word pairs](http://stackoverflow.com/questions/5764782/iterate-through-pairs-of-items-in-a-python-list), and pass it to `Counter`. — Kevin, Jan 12 '16 at 16:43
I don't have a choice regarding data structure. Obviously I am a total newb at Python, sorry for the wall of code. I really want to be a good member of the SO community and I greatly appreciate any help given! — Jesse Downing, Jan 12 '16 at 17:01
Thanks for asking the question! After having to research the `defaultdict` class to understand how it works and how to use it, putting it to working in future programs will probably be far more likely. There is a certain elegance that comes with its usage. — Noctis Skytower, Jan 13 '16 at 22:02

Noctis Skytower · Answer 1 · 2016-01-13T22:07:59.587

It is not necessary to use the collections module to implement a working solution:

Example 1

import itertools
import pprint


def main():
    array = 'the', 'cat', 'chased', 'the', 'dog', 'fled'
    frequency = {}
    add_frequency(frequency, array)
    pprint.pprint(frequency)


def add_frequency(frequency, array):
    for a, b in pairwise(array):
        if a in frequency:
            follower = frequency[a]
        else:
            follower = frequency[a] = {}
        if b in follower:
            follower[b] += 1
        else:
            follower[b] = 1


def pairwise(iterable):
    """s -> (s[0], s[1]), (s[1], [s2]), (s[2], s[3]), ..."""
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

if __name__ == '__main__':
    main()

The following code shows how you can use collections.defaultdict to do as you asked:

Example 2

import collections
import itertools
import pprint


def main():
    array = 'the', 'cat', 'chased', 'the', 'dog', 'fled'
    frequency = collections.defaultdict(lambda: collections.defaultdict(int))
    add_frequency(frequency, array)
    pprint.pprint(frequency)


def add_frequency(frequency, array):
    for a, b in pairwise(array):
        frequency[a][b] += 1


def pairwise(iterable):
    """s -> (s[0], s[1]), (s[1], [s2]), (s[2], s[3]), ..."""
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

if __name__ == '__main__':
    main()

You may also use functools.partial instead of a lambda when creating the defaultdict.

Example 3

from collections import defaultdict
from functools import partial
from itertools import tee
from pprint import pprint


def main():
    array = 'the', 'cat', 'chased', 'the', 'dog', 'fled'
    frequency = defaultdict(partial(defaultdict, int))
    add_frequency(frequency, array)
    pprint(frequency)


def add_frequency(frequency, array):
    for a, b in pairwise(array):
        frequency[a][b] += 1


def pairwise(iterable):
    """s -> (s[0], s[1]), (s[1], [s2]), (s[2], s[3]), ..."""
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

if __name__ == '__main__':
    main()

score 2 · Answer 2 · answered Jan 12 '16 at 17:19

I have a simple answer, though it does not use defaultdict - just standard dictionary and setdefault. I may have missed your intent, but this is what I am seeing:

def word_analysis(input):
    from itertools import tee, izip
    i1, i2 = tee(input)
    i2.next()
    results = {}
    for w1,w2 in izip(i1,i2):           # Process works pairwise
        d = results.setdefault(w1,{})   # Establish/use the first word dict
        d[w2] = 1 + d.setdefault(w2,0)  # Increment the counter
    return results

print word_analysis(['the', 'cat', 'chased', 'the', 'dog', 'fled'])

For me, this offers the same output you reported:

{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'dog': {'fled': 1}, 'cat': {'chased': 1}}

Am I missing something?

upvoted. I prefer this solution since it avoids using indexes. And it's a great way to teach how to use dictionaries and iterators. Kudos. — Pynchia, Jan 12 '16 at 17:46

SoreDakeNoKoto · Accepted Answer · 2016-01-12T17:24:29.337

1

Using defaultdict:

import collections

words = ['the', 'cat','chased', 'the', 'dog', 'fled']
result = collections.defaultdict(dict)

for i in range(len(words) - 1):   # loop till second to last word
    occurs = result[words[i]]    # get the dict containing the words that follow and their freqs
    new_freq = occurs.get(words[i+1], 0) + 1  # update the freqs
    occurs[words[i+1]] = new_freq

print list(result.items())

edited Jan 12 '16 at 17:24

answered Jan 12 '16 at 17:19

SoreDakeNoKoto

1,175
1
9
16

I greatly appreciate it gang. I am in awe of how quickly yall eviscerated that problem I was grinding against for hours! I'm working hard to be at that level on day... – Jesse Downing Jan 12 '16 at 17:38

Defaultdict(defaultdict) for text analysis

3 Answers3