I am using Python 2.7.4
I have pieced together a program that will read a .txt file, separate the words, remove the whitespace and punctuation, convert capital letters to lowercase, and return the x most common words, along with a count of how many times those words appear in the document. What I am trying - and have not been able - to do is to exclude certain most common words from the output (i.e., "a", "i", "to", "for", etc.).
I am a beginner, and so I may simply be misunderstanding the responses to certain questions that have already been answered (and that I have not been able to make use of), such as, among others:
How to remove list of words from a list of strings
and
Remove all occurrences of words in a string from a python list
I have tried to separate the different aspects into different functions to hopefully simplify things, though I suspect I may in fact be overcomplicating things. My program is below:
import string
from collections import Counter
def wordlist(line):
wordlist2 = []
wordlist1 = line.split()
for word in wordlist1:
cleanword = ""
for char in word:
if char in string.punctuation:
char = ""
if char in string.whitespace:
char = ""
cleanword += char
wordlist2.append(cleanword)
return wordlist2
def wordcaps(line):
line = [char.lower() for char in line]
return line
def countwords(document):
words = Counter()
words.update(document)
x = words.most_common()
print x
def readfile(filename):
fin = open(filename).read()
print countwords(wordcaps(wordlist(fin)))
Here are some of the things I have tried. I have tried to create a list - for example, filterlist = ['i', 'to', 'and'] - and to use this as a conditional in the wordlist function:
for word in wordlist1:
if word in filterlist:
word = ""
This does not seem to have any effect. I have also tried, to no avail:
for word in wordlist1:
if word in filterlist:
wordlist1.append("")
I have tried a bunch of other things, but this question seems to be getting too long in any event. I have seen references to "regex," but am just not sure what that is or how it fits in.