Printing a Unigram count in python

Question

I have a text file named corpus.txt containing the following 4 lines of text

 peter piper picked a peck of pickled peppers 
 a peck of pickled peppers peter piper picked 
 if peter piper picked a peck of pickled peppers 
 where s the peck of pickled peppers peter piper picked

I want the output of the program to print a word and the number of times it occurs for example like

4 peter
4 piper

etc.

This is the code that I have written

f = open("corpus.txt","r")
w, h = 100, 100;
k=1
a=0
uwordcount=[]
for i in range(100):
       uwordcount.append(0)
uword = [[0 for x in range(w)] for y in range(h)]
l = [[0 for x in range(w)] for y in range(h)] 
l[1] = f.readline()
l[2] = f.readline()
l[3] = f.readline()
l[4] = f.readline()
lwords = [[0 for x in range(w)] for y in range(h)] 
lwords[1]=l[1].split()
lwords[2]=l[2].split()
lwords[3]=l[3].split()
lwords[4]=l[4].split()
for i in [1,2,3,4]:
    for j in range(len(lwords[i])):
        uword[k]=lwords[i][j]
        uwordcount[k]=0
        for x in [1,2,3,4]:
            for y in range(len(lwords[i])):
                if uword[k] == lwords[x][y]:
                    uwordcount[k]=uwordcount[k]+1
        for z in range(k):
            if uword[k]==uword[z]:
                a=1

        if a==0:
            print(uwordcount[k],' ',uword[k])
            k=k+1

I am getting the error

Traceback (most recent call last): File "F:\New folder\1.py", line 25, in if uword[k] == lwords[x][y]: IndexError: list index out of range

Can anyone tell me what is the problem here

either one of `k`, `x` or `y` doesn't exist in the `uword`, `lwords` or, respectively, `lwords[x]` lists. you should not blindly access those values, but either test if something is there, or change the logic — Adelin, Feb 11 '19 at 10:09
You are having way too many loops and lists there. Just create _one_ list of words, then iterate the lines in the files and the words in each line. Or just use `collections.Counter(f.read().split())` if you are in a hurry. — tobias_k, Feb 11 '19 at 10:11

Patrick Artner · Answer 1 · 2019-02-11T10:38:55.307

IndexError: list index out of range means one of your indexes tries to access something outside of your lists - you would need to debug your code to find where that is the case.

Use collections.Counter to ease this task:

# with open('corups.txt', 'r') as r: text = r.read()

text = """peter piper picked a peck of pickled peppers 
 a peck of pickled peppers peter piper picked 
 if peter piper picked a peck of pickled peppers 
 where s the peck of pickled peppers peter piper picked """

from collections import Counter

# split the text in lines, then each line into words and count those:
c = Counter( (x for y in text.strip().split("\n") for x in y.split()) )

# format the output
print(*(f"{cnt} {wrd}" for wrd,cnt in c.most_common()), sep="\n")

Output:

4 peter
4 piper
4 picked
4 peck
4 of
4 pickled
4 peppers
3 a
1 if
1 where
1 s
1 the

tobias_k · Accepted Answer · 2019-02-11T11:13:16.577

You are having way too many different lists here. Also, don't rely on all those magic numbers for number of lines, maximum number of words/entries per list, etc. Instead of having one list for the words in each line, just use a single list for all the words. And instead of a second list for the counts, just use a dictionary to hold both the unique words and their counts:

with open("corpus.txt") as f:
    counts = {}
    for line in f:
        for word in line.split():
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1

Afterwards, counts looks like this: {'peter': 4, 'piper': 4, 'picked': 4, 'a': 3, 'peck': 4, 'of': 4, 'pickled': 4, 'peppers': 4, 'if': 1, 'where': 1, 's': 1, 'the': 1} For retrieving words and counts, you could also use a loop:

for word in counts:
    print(word, counts[word])

Of course, you can do the same in fewer lines of code using collections.Counter, but I think doing it manually will help you learn more about Python.

To be honest, I don't understand half of what any of the code below for i in [1,2,3,4]: is supposed to do. It seems like maybe you want to create a kind of co-occurance matrix for the words? In this case, too, I would suggest a (nested) dictionary, making it much easier to store and retrieve antries.

with open("corpus.txt") as f:
    matrix = {}
    for line in f:
        for word1 in line.split():
            if word1 not in matrix:
                matrix[word1] = {}
            for word2 in line.split():
                if word2 != word1:
                    if word2 not in matrix[word1]:
                        matrix[word1][word2] = 1
                    else:
                        matrix[word1][word2] += 1

The code is almost the same as before, but with another nested loop looping over the other words on the same line. For example, the output for "peter" would be {'piper': 4, 'picked': 4, 'a': 3, 'peck': 4, 'of': 4, 'pickled': 4, 'peppers': 4, 'if': 1, 'where': 1, 's': 1, 'the': 1}

score 0 · Answer 3 · answered Feb 11 '19 at 10:25

Honestly, I don't get your code because you way more loops and unnecessary logic(I guess). So I am doing it in my own way.

import pprint

with open('corups.txt', 'r') as cr:
     dic= {}  # Empty dictionary
     lines = cr.readlines()

     for line in lines:
         if line in dic:   # If key already exists in dic then add 1 to its value
             dic['line'] += 1

         else:
             dic['line'] = 1   # If key is not present in dic then create value as 1

pprint.pprint(dic)  # Using pprint built in function to print dictionary data types

If you are in real hurry then use collections.Counter

score 0 · Answer 4 · answered Feb 11 '19 at 11:01

using dictionary you can do this

from collections import defaultdict
dic = defaultdict(int)
with open('corpus.txt') as file:
    for i in file.readlines():
        for j in i.split():
            dic[j] +=1


for k,v in dic.items():
    print(v,k, sep='\t')

'''    
output

4       peter
4       piper
4       picked
3       a
4       peck
4       of
4       pickled
4       peppers
1       if
1       where
1       s
1       the

'''

Printing a Unigram count in python

4 Answers4