Exclude list of words when reading a file

Question

I am using Python 2.7.4

I have pieced together a program that will read a .txt file, separate the words, remove the whitespace and punctuation, convert capital letters to lowercase, and return the x most common words, along with a count of how many times those words appear in the document. What I am trying - and have not been able - to do is to exclude certain most common words from the output (i.e., "a", "i", "to", "for", etc.).

I am a beginner, and so I may simply be misunderstanding the responses to certain questions that have already been answered (and that I have not been able to make use of), such as, among others:

How to remove list of words from a list of strings

and

Remove all occurrences of words in a string from a python list

I have tried to separate the different aspects into different functions to hopefully simplify things, though I suspect I may in fact be overcomplicating things. My program is below:

import string

from collections import Counter

def wordlist(line):
    wordlist2 = []
    wordlist1 = line.split()
    for word in wordlist1:
        cleanword = ""
        for char in word:
            if char in string.punctuation:
                char = ""
            if char in string.whitespace:
                char = ""
            cleanword += char
        wordlist2.append(cleanword)
    return wordlist2

def wordcaps(line):
    line = [char.lower() for char in line]
    return line

def countwords(document): 
    words = Counter()
    words.update(document)
    x = words.most_common() 
    print x

def readfile(filename):
    fin = open(filename).read()
    print countwords(wordcaps(wordlist(fin)))

Here are some of the things I have tried. I have tried to create a list - for example, filterlist = ['i', 'to', 'and'] - and to use this as a conditional in the wordlist function:

for word in wordlist1:
    if word in filterlist:
        word = ""

This does not seem to have any effect. I have also tried, to no avail:

for word in wordlist1:
    if word in filterlist:
        wordlist1.append("")

I have tried a bunch of other things, but this question seems to be getting too long in any event. I have seen references to "regex," but am just not sure what that is or how it fits in.

This is some pretty good code for a beginner. :) In my experience most beginners would reinvent the wheel rather than using `collections.Counter`. Good job! :) — kojiro, Oct 20 '13 at 01:50
Thanks, @kojiro. This was my first programming compliment :) — alexponline, Oct 20 '13 at 04:36

user278064 · Accepted Answer · 2013-10-19T21:14:34.403

2

Usually is enough to do:

for word in wordlist1:
   if word.lower() not in filterlist:
        words.append(word)

words is the output list containing words which are valid.

Your approach does not work because you're using the same list for storing input and output list wordlist1:

for word in wordlist1:
    if word in filterlist:
        wordlist1.append("")

You could also do something like this:

wordlist1 = [word for word in wordslist1 if word not in filterlist]

which use an temporany list to store the valid words and then assign them back to your original input list.

edited Oct 19 '13 at 21:14

answered Oct 19 '13 at 21:08

user278064

9,982
1
33
46

Thanks. Your point that I was using the same list for input and output made this "click" for me. I have noticed though that while this seems to work almost all the time, certain words such as "me" and "said" do not get filtered out. Any idea why? – alexponline Oct 20 '13 at 04:30

score 0 · Answer 2 · answered Oct 20 '13 at 01:37

It's probably simplest to read the input one character at a time and check for the ones to include rather than the ones to exclude.

Once a candidate word has been extracted, it can then be converted to lower case and tested against a set of words to be skipped.

Here's a possible implementation:

def parse(text, skip=()):
    text += '\n'
    words = []
    word = ''
    for char in text:
        if char.isalpha():
            word += char
        elif word:
            word = word.lower()
            if word not in skip:
                words.append(word)
            word = ''
    return words

(NB: a newline is appended to the input to make sure the last word gets processed correctly).

Of course, it would be much more efficient to do the parsing with a regular expression:

import re

def parse(text, skip=()):
    words = []
    for word in re.findall(r'\w+', text):
        word = word.lower()
        if word not in skip:
            words.append(word)
    return words

Here's a simple script that uses the parse function to get word counts from an input file:

import sys
from collections import Counter

SKIP = set('a an and be i is of so the to'.split())

def main(args):
    try:
        with open(args[0]) as stream:
            words = parse(stream.read(), SKIP)
    except IndexError:
        print 'ERROR: no path given'
    except IOError as exception:
        print 'ERROR: could not read file:'
        print '  :', exception
    else:
        counter = Counter(words)
        print counter.most_common()

if __name__ == '__main__':

    main(sys.argv[1:])

Exclude list of words when reading a file

2 Answers2