0

I am working on a word cloud problem. I thought that my result covered the requirements as it produces a word cloud without the uninteresting words or punctuation, but apparently not. I cannot figure out what I am missing.

The script needs to process the text, remove punctuation, ignore cases and words that do not contain all alphabets, count the frequencies, and ignore uninteresting or irrelevant words. A dictionary is the output of the calculate_frequencies function. The wordcloud module will then generate the image from your dictionary.

My code:

def calculate_frequencies(file_contents):
    # Here is a list of punctuations and uninteresting words you can use to process your text
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    uninteresting_words = ["the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
    "we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
    "their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
    "have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
    "all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just", \
    "in", "for", "so" ,"on", "says", "not", "into", "because", "could", "out", "up", "back", "about"]
    
    # LEARNER CODE START HERE
      
    frequencies = {}
    words = file_contents.split()
    final_words = []
    
    
    
    for item in words:
        item = item.lower()
        
        if item in punctuations:
            words = words.replace(item, "")
                  
        if item not in uninteresting_words and item.isalpha()==True:
            final_words.append(item)
            
    for final in final_words:
        
        if final not in frequencies:
                frequencies[final]=0
        else:
                frequencies[final]+=1
                
    #wordcloud
    cloud = wordcloud.WordCloud()
    cloud.generate_from_frequencies(frequencies)
    return cloud.to_array()
Ethan
  • 876
  • 8
  • 18
  • 34
madmulchr
  • 9
  • 1
  • 3
    `words.replace` while you are iterating over `words` is probably a bad idea – OneCricketeer Sep 08 '21 at 00:15
  • You haven't said why it is deemed insufficient. Split probably only splits on whitespace, and punctuation follows words. Um, and you might want to do something pythonesque with a filter over a stream or something. – Maarten Bodewes Sep 08 '21 at 00:17
  • Please choose a better title of the question - it is not revealing any information about the problem. Read [this](https://stackoverflow.com/help/how-to-ask) short intro on how to ask questions on SO. – normanius Sep 08 '21 at 11:14
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 13 '21 at 08:31

2 Answers2

0

So here is some different approach!

punctuations and uninteresting_words are given separately off course for to students understand easily. Since, both of them are same str type simply concatenating them saves us work. Doing such reduces iteration to just one time only and script gets lot simpler and clean.

def calculate_frequencies(file_contents):
    # Here is a list of punctuations and uninteresting words you can use to process your text
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    uninteresting_words = ["the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
    "we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
    "their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
    "have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
    "all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just"]
    
    # LEARNER CODE START HERE
    
    frequencies = {}
    junk = punctuations + " ".join(uninteresting_words)
    lower_txt = file_contents.lower()
    
    for word in lower_txt.split(" ") :
        if word not in junk and word.isalpha() :
            frequencies[word] = lower_txt.count(word)
            
    
    #wordcloud
    cloud = wordcloud.WordCloud()
    cloud.generate_from_frequencies(frequencies)
    return cloud.to_array()

Question asked frequencies of word in given text file. count() comes in very handy for this.

halfer
  • 19,824
  • 17
  • 99
  • 186
Krish
  • 9
  • 3
-1

As written, I don't think your code runs. words is a list, and .replace is not a valid list method.


To simply get the counts, see this code

For punctuation refer - Best way to strip punctuation from a string

For counting, use a Counter

import string
from collections import Counter

uninteresting_words = {"the", "a", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
"we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
"their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
"have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
"all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just", \
"in", "for", "so" ,"on", "says", "not", "into", "because", "could", "out", "up", "back", "about"}

def calculate_frequencies(s):
  global uninteresting_words
  words = (x.lower().strip().translate(str.maketrans('', '', string.punctuation)) for x in s.strip().split())

  c = Counter(words)
  for x in uninteresting_words:
    if x in c:
      del c[x]
  return c

print(calculate_frequencies('this is a string! A very fancy string?'))
# Counter({'string': 2, 'fancy': 1})

For a WordCloud, you shouldn't need to count anything, as it would do it for you. Notice there is a parameter for stopwords and a process_text function that uses a regex pattern that ignores punctuation by default - https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245