0

I am relatively new to python and am trying to write a basic filter for a text file, then count the frequency of words found in the filtered lines. I've attempted to apply a stopword list to it. So far I have this:

import sys, re
from collections import Counter
from nltk.corpus import stopwords

reload(sys)  
sys.setdefaultencoding('utf8')

term = sys.argv[2].lower()
empty = []
count = 0


# filter lines containing term and also add them to empty list
with open(sys.argv[1]) as f:
    for line in f:
        for text in line.lower().split("\n"):
            if term in text:
                empty.append(text)
                count += 1
                print text

# create stopword list from nltk
stop = stopwords.words("english")
stoplist = []


# apply stopword list to items in list containing lines matching term 
for y in empty:
    for t in stop:
        if t not in y:
            stoplist.append(y)

# count words that appear in the empty list
words = re.findall(r"\w+", str(stoplist))
wordcount = Counter(words)

print wordcount
print "\n" + "Number of times " + str(term) + " appears in text is: " + str(count)

This works fine (but is probably incredibly messy/inefficient) but seemingly returns a count of the words filtered which is much too high, ten times as high really.

I was just wondering if anyone could spot something obvious I'm missing and point me in the right direction of how to fix it. Any help would really be appreciated, thanks!

Pavlin
  • 5,390
  • 6
  • 38
  • 51
saph_top
  • 677
  • 1
  • 6
  • 23
  • How sure are you that the numbers are too high? – Slater Victoroff Nov 25 '15 at 22:25
  • 1
    You're adding one copy of each line of text to `stoplist` for each stopword. That's probably your problem. I have no idea what you're trying to do with the stopwords, so I can't recommend a simple fix. – Peter DeGlopper Nov 25 '15 at 22:27
  • i think you meant `if y not in stop` rather than having a `for t in stop` – R Nar Nov 25 '15 at 22:32
  • 1
    Yes, several ways unfortunately. To start, `empty` is a list of whole lines - splitting on `\n` for each line will give you the whole line. You need to do something to split each line into words before you can use the stopwords list. This answer might get you heading more in the right direction: http://stackoverflow.com/a/19133088/2337736 – Peter DeGlopper Nov 25 '15 at 22:54
  • @PeterDeGlopper I think this is the problem. In my head I was trying to compare all of the words found in the filtered list "empty", against all of the stopwords "stop", and any words that were not in the stopword list, I'd append to another list "stoplist" and do the count over that.Maybe I should have done this part differently? – saph_top Nov 25 '15 at 22:55
  • Yes. You probably want to create a set of stopwords based on the ntlk data and check whether each word is in that set after splitting each line of text into words. As it is, `if t not in y` checks each stopword. If that stopword does not appear as a substring of `y`, where `y` is a whole line, it appends one copy of `y` - so a line that contains no stopwords as substrings will be appended `len(stop)` times. – Peter DeGlopper Nov 25 '15 at 23:01
  • @PeterDeGlopper got you, all sorted, added in the step to split the contents of the empty list, and then run this against the stopword list. thanks so much! – saph_top Nov 25 '15 at 23:08
  • As an extra tip, `for text in line.lower().split("\n"):` you shouldn;'t need the `.split("\n")` part as you're already iterating over lines (it doesn't really make sense to split a line on newlines!) – Tom Dalton Nov 25 '15 at 23:58

0 Answers0