-3

So, my code reads a poem from a file. It then counts the number of time a word occurs and adds to a dictionary. However, my code repeats the words and counts separately.

Here is my code:

def unique_word_count(file):
    words = open(file, "r")
    lines = words.read()
    words = lines.split()
    counts = dict()

    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

    return counts

For, example if we have the string, "Hi, hi, hi how are you": The output for my code would come out as:

{"Hi":1, "hi":1, "hi":1, "how":1, "are":1, "you":1}

While it should come out as:

{"hi":3, "how":1, "are":1, "you":1}

How can I fix my code so that it does not repeat words? Thank you!

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
flory
  • 27
  • 4
  • Note that indenting only the first line doesn't format your code correctly. Use code blocks (indent ALL code by 4 spaces) or code fences (three backticks `\`\`\`` on the lines before and after your code) [Formatting help](/help/formatting) – Pranav Hosangadi Dec 14 '22 at 04:14
  • 1
    The output you claim to obtain is impossible -- Keys in a dict are unique, so it is not possible for a dict to have two `"hi"` keys. You probably got `{"Hi": 1, "hi":2, "how":1, "are":1, "you":1}`, but if you truly understand how your code works, combining those two `"hi"` values should be trivially easy using `str.lower` – Pranav Hosangadi Dec 14 '22 at 04:15
  • Does this answer your question? [How do I lowercase a string in Python?](https://stackoverflow.com/questions/6797984/how-do-i-lowercase-a-string-in-python) – Pranav Hosangadi Dec 14 '22 at 04:17
  • Hello, I understand combining the "Hi" and "hi" but my code actually does counts the two "hi" separately. Not sure why. – flory Dec 14 '22 at 04:19
  • Look at your dictionary carefully. The first two keys contain a trailing comma, since `.split()` only splits by spaces. Looks like you need a better definition of what a "word" is. – Pranav Hosangadi Dec 14 '22 at 04:21
  • The string is just an example, the actual poem in the file is very long and I don't think, I could post it entirely here but I could show part of the input and output. Part of the poem, "Row, row, row your boat". The output of my code is, "{"Row": 10, "row": 10, "row": 10, "your": 10, "boat":10}. Note: the number count is ten because the poem repeats the string multiple times. – flory Dec 14 '22 at 04:24
  • 2
    Again, that is not possible. Your dictionary can NOT have two `"row"` keys. It has one `"row"` key, and one `"row,"` key. *Notice the **trailing comma** in the second key*. Now you need to ask how you can remove trailing commas (or periods or other symbols) from a word. – Pranav Hosangadi Dec 14 '22 at 04:25
  • as an aside, you're currently not closing the file before the function ends, or before you repurpose the variable `words`. After the line `lines = words.read()` you need to add `words.close()` or change that part of the function to use the file context manager. – nigh_anxiety Dec 14 '22 at 04:38

1 Answers1

1

The reason why you're get that output is because of split.

split() splits the sentence in place where there is a space. So in it treats hi, and hi differently. So try using this code

import re

def word_count(str):
    counts = dict()
    words = re.findall(r'\w+', str.lower())
    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

    return counts

print( word_count('Hi, hi, hi how are you'))

The usual way to find words in a string is to use split, but that can fail, so you need regular expressions to do this.

\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_].

After the findall function filters the string and pulls out the words while ignoring punctuation, it returns the list.

Edit: Added explanation