4


Input:

Once upon a time a time this upon a


Output:

dictionary {
    'Once upon': 1,
       'upon a': 2,
       'a time': 2,
       'time a': 1,
    'time this': 1,
    'this upon': 1
}


CODE:

def countTuples(path):
    dic = dict()
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

I am getting this error:

File "C:/Users/user/Anaconda3/hw2.py", line 100, in countTuples
    dic[str(s[i]) + ' ' + str(s[i+1])] += 1
TypeError: list indices must be integers or slices, not str

If I remove the += and just place =1 everything works just fine, I guess the problem is when I try to access an entry to extract a value that doesn't exist yet ?

What can I do to fix this ?

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
Tony Tannous
  • 14,154
  • 10
  • 50
  • 86

4 Answers4

4

You can use a defaultdict to make your solution work. With a defaultdict, you specify the default type of the value of a key-value pair. This allows you to make an assignment like +=1 to a key which has not been explicitly created, yet:

import codecs
from collections import defaultdict

def countTuples(path):
    dic = defaultdict(int)
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

>>> {'Once upon': 1,
     'a time': 2,
     'this upon': 1,
     'time a': 1,
     'time this': 1,
     'upon a': 2})
pansen
  • 6,433
  • 4
  • 19
  • 32
2

One solution that requires minimal changing of your code is to just use a defaultdict:

from collections import defaultdict

line = 'Once upon a time a time this upon a'

dic = defaultdict(int)

s = line.split()

for i in range(0, len(s)-1):
    dic[str(s[i]) + ' ' + str(s[i+1])] += 1

This produces:

dic

defaultdict(int,
            {'Once upon': 1,
             'a time': 2,
             'this upon': 1,
             'time a': 1,
             'time this': 1,
             'upon a': 2})

Your function then just becomes:

def countTuples(path):
    dic = defaultdict(int)
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic
Kewl
  • 3,327
  • 5
  • 26
  • 45
2

No need to make it that hard, simply use a Counter and use zip to feed bigrams to the counter, like:

from collections import Counter

def countTuples(path):
    dic = Counter()
    with codecs.open(path, 'r', 'utf-8') as f
        for line in f:
            s = line.split()
            dic.update('%s %s'%t for t in zip(s,s[1:]))
    return dic
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
0

The best answer I've found so far is from Word frequency count based on two words using python

from collections import Counter
two_words = [' '.join(ws) for ws in zip(text, text[1:])]
dic = {w:f for w, f in Counter(two_words).most_common() if f>1}

One can modify the if f>1 part to get the two consecutive words over a threshold.

user3503711
  • 1,623
  • 1
  • 21
  • 32