2

I would like to text mine an excel file. First I must concatenate all rows into one large text file. Then, scan the text for words in a dictionary. If the word is found, count it as the dictionary key name. Finally return the list of counted words in a relational table [word, count]. I can count the words, but am unable to get the dictionary part to work. My question is:

  1. am I going about this the right way?
  2. is it even possible, and how so?

tweaked code from the internet


import collections
import re
import matplotlib.pyplot as plt
import pandas as pd
#% matplotlib inline
#file = open('PrideAndPrejudice.txt', 'r')
#file = file.read()

''' Convert excel column/ rows into a string of words'''
#text_all = pd.read_excel('C:\Python_Projects\Rake\data_file.xlsx')
#df=pd.DataFrame(text_all)
#case_words= df['case_text']
#print(case_words)
#case_concat= case_words.str.cat(sep=' ')
#print (case_concat)
text_all = ("Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever.")
''' done'''
import collections
import pandas as pd
import matplotlib.pyplot as plt
#% matplotlib inline
# Read input file, note the encoding is specified here 
# It may be different in your text file

# Startwords
startwords = {'happy':'glad','sad': 'lonely','big': 'tall', 'smart': 'clever'}
#startwords = startwords.union(set(['happy','sad','big','smart']))

# Instantiate a dictionary, and for every word in the file, 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in text_all.lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace(":","")
    word = word.replace("\"","")
    word = word.replace("!","")
    word = word.replace("“","")
    word = word.replace("‘","")
    word = word.replace("*","")
    if word  in startwords:
        if word  in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
    print(word, ": ", count)
# Close the file
#file.close()
# Create a data frame of the most common words 
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')

Error: Empty 'DataFrame': no numeric data to plot

Expected output:

  1. happy 1
  2. sad 1
  3. big 1
  4. smart 1
RobE
  • 93
  • 2
  • 11
  • 1
    Hey Buddy, thanks for sharing your code, but It would be much better if you show your raw code and the output after you run your code. Generally that's enough for others to help write a custom solution to your problem. Have a read of [ask]. – Umar.H Nov 26 '19 at 15:36
  • I get errors. Because I don't know how to approach the problem. I don't think I am getting it to read the dictionary the way I want it to. I'll work on the code a bitmore and see if I can clarify my question/ issue. – RobE Nov 26 '19 at 15:49
  • Sorry Rob, I said Raw Code when I had intended to say Raw Data, if you can give a sample of your data I'll it my best attempt – Umar.H Nov 26 '19 at 15:51
  • Do we need matlabplot and pandas to get a word count? Does it work correctly on a small sample that you can include in the question? Also, there are errors in startwords. – Kenny Ostrom Nov 26 '19 at 15:51
  • This is code I found on the internet. It was close to what I needed so I thought I would tweak it. However, tweaking it does not seem to be working for me..LOL. I need pandas because the real data is in a relational database. That part I got. Its counting words that are in a dictionary that I am having issues with. Matplotlib is so I can plot the data in a barchart (Pareto) . – RobE Nov 26 '19 at 16:01
  • This may help: https://stackoverflow.com/questions/4998629/split-string-with-multiple-delimiters-in-python?rq=1 – Kenny Ostrom Nov 26 '19 at 16:06
  • You're also duplicating some work already done in the library. see https://docs.python.org/3/library/collections.html#counter-objects – Kenny Ostrom Nov 26 '19 at 16:21

2 Answers2

3

Here is a method that should work with the latest version of pandas (0.25.3 at the time of writing):

# Setup
df = pd.DataFrame({'case_text': ["Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever."]})

startwords = {"happy":["glad","estatic"],
              "sad": ["depressed", "lonely"],
              "big": ["tall", "fat"],
              "smart": ["clever", "bright"]}

# First you need to rearrange your startwords dict
startwords_map = {w: k for k, v in startwords.items() for w in v}

(df['case_text'].str.lower()     # casts to lower case
 .str.replace('[.,\*!?:]', '')   # removes punctuation and special characters
 .str.split()                    # splits the text on whitespace
 .explode()                      # expands into a single pandas.Series of words
 .map(startwords_map)            # maps the words to the startwords
 .value_counts()                 # counts word occurances
 .to_dict())                     # outputs to dict

[out]

{'happy': 2, 'big': 1, 'smart': 1, 'sad': 1}
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
2
 if word  in startwords:
    if word  in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1

This part seems problematic, it checks if word in startwords and then further check in wordcount , if it's in the wordcount, it should increase the word count by your logic. So I believe you have to switch the execution.

    if word in wordcount:
        //in dict, count++
        wordcount[word] += 1
    else:
        // first time, set to 1
        wordcount[word] = 1
Yunhai
  • 1,283
  • 1
  • 11
  • 26
  • Thank you, that did take care of the overall word counting problem. Now to get it to count only words in the dictionary and return the key count. – RobE Nov 26 '19 at 16:21