Given a text count occurrences of all two consecutive words

Question

Input:

Once upon a time a time this upon a

Output:

dictionary {
    'Once upon': 1,
       'upon a': 2,
       'a time': 2,
       'time a': 1,
    'time this': 1,
    'this upon': 1
}

CODE:

def countTuples(path):
    dic = dict()
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

I am getting this error:

File "C:/Users/user/Anaconda3/hw2.py", line 100, in countTuples
    dic[str(s[i]) + ' ' + str(s[i+1])] += 1
TypeError: list indices must be integers or slices, not str

If I remove the += and just place =1 everything works just fine, I guess the problem is when I try to access an entry to extract a value that doesn't exist yet ?

What can I do to fix this ?

A counter will iterate for every tuple over the file. I can't afford that, it will be n^2 in time complexity and I want to avoid that. @WillemVanOnsem — Tony Tannous, Apr 14 '17 at 12:36

pansen · Accepted Answer · 2017-04-14T13:16:13.627

You can use a defaultdict to make your solution work. With a defaultdict, you specify the default type of the value of a key-value pair. This allows you to make an assignment like +=1 to a key which has not been explicitly created, yet:

import codecs
from collections import defaultdict

def countTuples(path):
    dic = defaultdict(int)
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

>>> {'Once upon': 1,
     'a time': 2,
     'this upon': 1,
     'time a': 1,
     'time this': 1,
     'upon a': 2})

score 2 · Answer 2 · answered Apr 14 '17 at 12:43

One solution that requires minimal changing of your code is to just use a defaultdict:

from collections import defaultdict

line = 'Once upon a time a time this upon a'

dic = defaultdict(int)

s = line.split()

for i in range(0, len(s)-1):
    dic[str(s[i]) + ' ' + str(s[i+1])] += 1

This produces:

dic

defaultdict(int,
            {'Once upon': 1,
             'a time': 2,
             'this upon': 1,
             'time a': 1,
             'time this': 1,
             'upon a': 2})

Your function then just becomes:

def countTuples(path):
    dic = defaultdict(int)
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

Willem Van Onsem · Answer 3 · 2017-04-14T13:01:14.927

2

No need to make it that hard, simply use a Counter and use zip to feed bigrams to the counter, like:

from collections import Counter

def countTuples(path):
    dic = Counter()
    with codecs.open(path, 'r', 'utf-8') as f
        for line in f:
            s = line.split()
            dic.update('%s %s'%t for t in zip(s,s[1:]))
    return dic

edited Apr 14 '17 at 13:01

answered Apr 14 '17 at 12:53

Willem Van Onsem

443,496
30
428
555

score 0 · Answer 4 · answered Aug 01 '22 at 16:13

The best answer I've found so far is from Word frequency count based on two words using python

from collections import Counter
two_words = [' '.join(ws) for ws in zip(text, text[1:])]
dic = {w:f for w, f in Counter(two_words).most_common() if f>1}

One can modify the if f>1 part to get the two consecutive words over a threshold.

Given a text count occurrences of all two consecutive words

4 Answers4