1

The paragraph is meant to have spaces and random punctuation, I removed them in my for loop, by doing .replace. Then I made paragraph into a list by .split() to get ['the', 'title', 'etc']. Then I made two functions count words to count each word but I didn't want it to count every word, so I made another function to create a unique list. However, I need to create a for loop to print out each word and how many times it been said with the output being something like this

The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.

I also have a hard time understanding what a for loop essentially does. I read that we should just be using for loops for counting, and while loops for any other things but a while loop can also be used for counting.

    paragraph = """  The titled track “Heart Attack” does not interpret the 
    feelings of being in love in a serious way, 
    but with Chuu’s own adorable emoticon like ways. The music video has 
    references to historical and fictional 
    figures such as the artist Rene Magritte!!....  """


for r in ((",", ""), ("!", ""), (".", ""), ("  ", "")):
    paragraph = paragraph.replace(*r)

paragraph_list = paragraph.split()


def count_words(word, word_list):

    word_count = 0
    for i in range(len(word_list)):
        if word_list[i] == word:
            word_count += 1
    return word_count

def unique(word):
    result = []
    for f in word:
        if f not in result:
            result.append(f)
    return result
unique_list = unique(paragraph_list)
Cream
  • 29
  • 7
  • I don't think you want to get rid of spaces if you plan on splitting the data. – Mad Physicist Oct 10 '18 at 06:12
  • the start and end have two spaces so ("  ", "") is just removing the two spaces. Yeah sorry about that I should've mentioned that. – Cream Oct 10 '18 at 06:15
  • You forgot to remove quotes and newlines – Mad Physicist Oct 10 '18 at 06:23
  • `set()` creates a set and throws it away. You need to assign it to a variable to use it. You may also return the set itself from `unique()`. No need to convert to list as you can enumerate and lookup elements in a set too. – Adrian W Oct 10 '18 at 06:23
  • `def unique(word): result = [] for f in word: if f not in result: result.append(f) return result` it seems that just works without set() – Cream Oct 10 '18 at 06:31
  • You get the **freq** using a `dict{word:count}` in **one** loop. Read about [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter) – stovfl Oct 10 '18 at 06:40

2 Answers2

3

It is better if you use re and get with a default value:

paragraph = """  The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!....  c c c c c c c ccc"""

import re

word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
    word_count[w] = word_count.get(w, 0) + 1
del word_count['']

for k, v in word_count.items():
    print("The word {} appears {} time(s) in the paragraph".format(k, v))

Output:

The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...

It is discussible what to do with Chuu’s, I decided not to split in but you can add that later if you want.

Update:

The following line splits paragraph.lower() using a regular expression. The advantage is that you can describe multiple separators

re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()

With respect to this line:

word_count[w] = word_count.get(w, 0) + 1

word_count is a dictionary. The advantage of using get is that you can define a default value in case w is not in the dictionary yet. The line basically updates the count for word w

Juan Leni
  • 6,982
  • 5
  • 55
  • 87
  • I wonder why I've got a negative vote. Feedback is appreciated, I am happy to improve the answer. – Juan Leni Oct 10 '18 at 07:55
  • Thank you, that was very informative. I didn't get to read about dictionaries in python. I didn't know you could easily do this function without functions and just using for loops. However, what is happening right here? `word_count[w] = word_count.get(w, 0) + 1` and `re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()` – Cream Oct 10 '18 at 18:30
0

Beware, your example text is simple but punctuation rules can be complex or not correctly observed. What is the text contains 2 adjacent spaces (yes it is incorrect but frequent)? What if the writer is more used to French and writes spaces before and after a colon or semicolon?

I think the 's construct need special processing. What about: """John has a bicycle. Mary says that her one is nicer that John's.""" IMHO the word John occurs twice here, while your algo will see 1 John and 1 Johns.

Additionaly as Unicode text is now common on WEB pages, you should be prepared to find high code equivalents of spaces and punctuations:

“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
  U+00A0 NO-BREAK SPACE

In addition, according to this older question to best way to remove punctuation is translate. Linked question used Python 2 syntax, but in Python 3 you can do:

paragraph = paragraph.strip()                   # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' '))  # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph)  # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252