0

The output is unsorted, and sorting on the second column is not possible. Is there special method to sort on the second value.

This program takes a text and counts how many times a word is in a text

import string
with open("romeo.txt") as file:  # opens the file with text
    lst = []
    d = dict ()
    uniquewords = open('romeo_unique.txt', 'w')
    for line in file:
        words = line.split()
        for word in words:  # loops through all words
            word = word.translate(str.maketrans('', '', string.punctuation)).upper()  #removes the punctuations
            if word not in d:
                d[word] =1
            else:
                d[word] = d[word] +1

            if word not in lst:
                lst.append(word)    # append only this unique word to the list
                uniquewords.write(str(word) + '\n') # write the unique word to the file
print(d)

3 Answers3

1

Dictionaries with default value

The code snippet:

d = dict()
...
if word not in d:
    d[word] =1
else:
    d[word] = d[word] +1

has become so common in python that a subclass of dict has been created to get rid of it. It goes by the name defaultdict and can be found in module collections.

Thus we can simplify your code snippet to:

from collections import defaultdict

d = defaultdict(int)
...
d[word] = d[word] + 1

No need for this manual if/else test; if word is not in the defaultdict, it will be added automatically with initial value 0.

Counters

Counting occurrences is also something that is frequently useful; so much so that there exists a subclass of dict called Counter in module collections. It will do all the hard work for you.

from collections import Counter
import string

with open('romeo.txt') as input_file:
    counts = Counter(word.translate(str.maketrans('', '', string.punctuation)).upper() for line in input_file for word in line.split())

with open('romeo_unique.txt', 'w') as output_file:
  for word in counts:
    output_file.write(word + '\n')

As far as I can tell from the documentation, Counters are not guaranteed to be ordered by number of occurrences by default; however:

  • When I use them in the interactive python interpreter they are always printed in decreasing number of occurrences;
  • they provide a method .most_common() which is guaranteed to return in decreasing number of occurrences.
Stef
  • 13,242
  • 2
  • 17
  • 28
  • This is great! But would it be possible to order by the number of counts? –  Oct 27 '20 at 12:50
  • Yes. See [`Counter.most_common`](https://docs.python.org/3.3/library/collections.html#collections.Counter.most_common). Incidentally, when I try it in the python repl, the Counter object appears to already be sorted like that. But the doc doesn't state that it should be. – Stef Oct 27 '20 at 12:55
  • Thank you very much. I am new into programming! But I will try to implement this in my code. But that will take some time. But it looks like that it is visible to order/sort my output based on the occurrences. –  Oct 27 '20 at 13:25
  • I have tested it. But i might made a mistake but the output is not correct.with open('romeo.txt') as input_file: counts = Counter(word.translate(str.maketrans('', '', string.punctuation)).upper() for line in input_file for word in line.split()) with open('romeo_unique3.txt', 'w') as output_file: for word in counts: output_file.write(word + '\n') test=counts.most_common() print (test) –  Oct 27 '20 at 15:24
  • @HJCur What is not correct about the output? I tested it on a file containing ten lines from Romeo&Juliet and got `[('NOR', 5), ('NAME', 5), ('A', 4), ('THAT', 4), ('OTHER', 3), ('WHICH', 3), ('ROMEO', 3), ('NOT', 2), ('MONTAGUE', 2), ('WHAT’S', 2), ('IS', 2), ('ANY', 2), ('PART', 2), ('WOULD', 2), ('HE', 2), ('THOU', 1), ('ART', 1), ('THYSELF', 1), ('THOUGH', 1), ('IT', 1), ('HAND', 1), ('FOOT', 1), ('ARM', 1), ('FACE', 1), ('BELONGING', 1), ...]` which looks correct to me. – Stef Oct 27 '20 at 15:28
  • I have run the orginal romeo and julia text and romeo should be counted 156 times. But it is not shown at all. Mayby if you run the whole file something gets wrong. Or maybe i am doing something wrong... –  Oct 27 '20 at 16:35
  • [('AND', 713), ('THE', 680), ('I', 585), ('TO', 541), ('A', 468), ('OF', 401), ('MY', 360), ('THAT', 347), ('IS', 344), ('IN', 319), ('YOU', 291), ('THOU', 277), ('ME', 265), ('NOT', 260), ('WITH', 255), ('IT', 228), ('THIS', 226), ('FOR', 224), ('BE', 213), ('BUT', 183), ('WHAT', 165), ('THY', 164), ('ROM', 163), ('HER', 156), ('AS', 155), ('O', 154), ('NURSE', 150), ('WILL', 149), ('SO', 147)......etc..Where is Romeo? –  Oct 27 '20 at 16:41
  • print(sorted(uniqueness.items(), key=itemgetter(1),reverse=True)) print(sorted(first_occurence.items(), key=itemgetter(1),reverse=True)) –  Oct 27 '20 at 16:42
  • If i take the whole text and run it. And I do the open the same file in a texteditor I can check the outcome by using crtl F, this is to check the number of time a word appears some words have a different outcome as in Python. –  Oct 27 '20 at 16:50
  • @HJCur Could that be related to the way you handle punctuation? – Stef Oct 27 '20 at 22:49
  • @HJCur I notice you have `('ROM', 163)` and `('O', 154)` in there. Could it be that "Roméo" with an accent on the e is handled incorrectly? – Stef Oct 27 '20 at 22:50
  • No Rom. is OK and is mentioned in the orginal romeo.txt What I found was that Romeo has 137 counts and Romeos (=Romeo's) has 19 counts. So 137+19=156 and in notepad Romeo is counted 156 times. So the programs works as expected :-) –  Oct 27 '20 at 23:25
  • The punctuation was indeed something what made a little bit complex. Because I wanted to find unique words. Maybe there is a better solution to remove the punctuations. But it will be difficult to built exactly the same "CTRL F" algoritme as in NotePad. –  Oct 27 '20 at 23:34
  • If you want you can find the text of old books online http://www.gutenberg.org/ebooks/1112 This is where I found the text from Romeo and Juliet. –  Oct 27 '20 at 23:37
0

In Python, standard dictionaries are an unsorted data type, but you can look here, assuming that with sorting your output you mean d

JLeno46
  • 1,186
  • 2
  • 14
  • 29
  • Yes I mean (d). I would like to sort (d) based on the number of occurrences. From the highest to the lowest. So if I understand correctly sorting is not possible because it is a dictionary. –  Oct 27 '20 at 12:19
  • 1
    It is possible since python 3.6/3.7. See the accepted answer to the question linked in this answer. – Stef Oct 27 '20 at 12:57
0

A couple of remarks first:

  • You are not sorting explicitly (e.g. by using sorted) by a given property. Dictionaries might be considered to have a "natural" order by the alphanumeric value of the value part of each key-value pair and they might sort correctly when iterated (e.g. for printing), but it is better to explicitly sort a dict.
  • You check the existence of a word in the lst variable, which is very slow since checking a list requires checking all entries until something is found (or not). It would be much better to check for existence in a dict.
  • I'm assuming by "the second column" you mean the information for each word that counts the order in which the word first appeared.

With that I'd change the code to also record the word index of the first occurence of each word with, which then allows for sorting on exactly that.

Edit: Fixed the code. The sorting yielded by sorted sorts by key, not value. That's what I get for not testing code before posting an answer.

import string
from operator import itemgetter

with open("romeo.txt") as file: # opens the file with text
    first_occurence = {}
    uniqueness = {}
    word_index = 1
    uniquewords = open('romeo_unique.txt', 'w')

    for line in file:
        words = line.split()

        for word in words: # loops through all words
            word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations

            if word not in uniqueness:
                uniqueness[word] = 1
            else:
                uniqueness[word] += 1

            if word not in first_occurence:
                first_occurence[word] = word_index
                uniquewords.write(str(word) + '\n') # write the unique word to the file

            word_index += 1

    print(sorted(uniqueness.items(), key=itemgetter(1)))
    print(sorted(first_occurence.items(), key=itemgetter(1)))
Etienne Ott
  • 656
  • 5
  • 14
  • Thank you very much for your help. What I hope to achieve is not the alphanumeric order but indeed based on "the second column" which is the word count. So the word "THE" is counted 690 times must be shown first. But I am already very happy with your help! –  Oct 27 '20 at 12:36
  • @HJCur I just posted a correction to my modifications to the code. I forgot that dicts are sorted by key, not value, as default behaviour. The argument in the call to ```sorted``` fixes that. Please note the additional import. – Etienne Ott Oct 27 '20 at 12:39
  • This program reads the text from Romeo and Juliet..And I try to make a ranking of all unique words and which word has been used the most...This is a piece of the output t{'1595': 1, 'THE': 680, 'TRAGEDY': 1, 'OF': 401, 'ROMEO': 137, 'AND': 713, 'JULIET': 59,.......So AND is counted 713 times. So I expect {AND:713,THE:680,OF:401,ROMEO:137.....etc –  Oct 27 '20 at 13:03
  • The text from Romeo and Juliet can be copied from https://www.gutenberg.org/files/1112/1112.txt . This in case you want to see the output of this program. I also have been trying to see if I should save to output as a jason file but I am just starting to learn Python so I do not know if it is possible with some workaround. –  Oct 27 '20 at 13:15
  • @HJCur That is what my modifications are supposed to do, albeit in reverse order. A descending order can be achieved by adding the argument ```reverse=True``` to ```sorted(uniqueness.items(), key=itemgetter(1), reverse=True)```. Anyway, the answer by Stef seems better suited to your needs. – Etienne Ott Oct 27 '20 at 13:30
  • Your solutions works perfectly. I tested it! Thank you very much! print(sorted(uniqueness.items(), key=itemgetter(1),reverse=True)) This did the job :-) –  Oct 27 '20 at 14:08
  • Now this program can be used for any text. And it will show how many times each unique word appears in text. I would like to thank you all for helping me. –  Oct 27 '20 at 14:14
  • I will also test if all the generated output is correct. But this will take some time. The first result looked ok... –  Oct 27 '20 at 14:23
  • Etienne your solution also works fine! It is exactly the same output! As the solution from Stef I would like to thank you both for you help. It had to do with "Romeo" and "Romeo's" but now I understand the confusion :-) –  Oct 27 '20 at 20:08