I am relatively new to python and am trying to write a basic filter for a text file, then count the frequency of words found in the filtered lines. I've attempted to apply a stopword list to it. So far I have this:
import sys, re
from collections import Counter
from nltk.corpus import stopwords
reload(sys)
sys.setdefaultencoding('utf8')
term = sys.argv[2].lower()
empty = []
count = 0
# filter lines containing term and also add them to empty list
with open(sys.argv[1]) as f:
for line in f:
for text in line.lower().split("\n"):
if term in text:
empty.append(text)
count += 1
print text
# create stopword list from nltk
stop = stopwords.words("english")
stoplist = []
# apply stopword list to items in list containing lines matching term
for y in empty:
for t in stop:
if t not in y:
stoplist.append(y)
# count words that appear in the empty list
words = re.findall(r"\w+", str(stoplist))
wordcount = Counter(words)
print wordcount
print "\n" + "Number of times " + str(term) + " appears in text is: " + str(count)
This works fine (but is probably incredibly messy/inefficient) but seemingly returns a count of the words filtered which is much too high, ten times as high really.
I was just wondering if anyone could spot something obvious I'm missing and point me in the right direction of how to fix it. Any help would really be appreciated, thanks!