0

In Python I am trying to create a list (myClassifier) that appends a classification ('bad'/'good') for each text file (txtEntry) stored in a list (txtList), based on whether or not it contains a bad word stored in a list of bad words (badWord).

txtList = ['mywords.txt', 'apple.txt, 'banana.txt', ... , 'something.txt']
badWord = ['pie', 'vegetable, 'fatigue', ... , 'something']

txtEntry is merely a placeholder, really I just want to iterate through every entry in txtList.

I've produced the following code in response:

for txtEntry in txtList:
    if badWord in txtEntry:
        myClassifier += 'bad'
    else:
        myClassifier += 'good'

However I'm receiving TypeError: 'in ' requires string as left operand, not list as a result.

I'm guessing that badWord needs to be a string as opposed to a list, though I'm not sure how I can get this to work otherwise.

How could I otherwise accomplish this?

Alpine
  • 533
  • 1
  • 6
  • 18
  • Can you please post your input data sample? – Manoj Awasthi Mar 26 '14 at 07:07
  • Ok what is the type of badWord and txtEntry, from the error I am assuming badWord is list and txtEntry is string ? – James Sapam Mar 26 '14 at 07:19
  • so if txtEntry is string and badword is list then you probably need to alternate the if statement.ie: `if txtEntry in badWord:` – suhailvs Mar 26 '14 at 07:34
  • I added in some examples; badWord is a list of bad words, and txtEntry is merely a placeholder used to iterate through each entry in the txtList list. – Alpine Mar 26 '14 at 08:17
  • 1
    to clarify: do you want to find bad words in a *file name* or in its content (`open('file name').read()`)? – jfs Mar 26 '14 at 08:39
  • 1
    @J.F. Sebastian I'd like to find bad words in a file's content. – Alpine Mar 26 '14 at 08:47

4 Answers4

2

This

if badWord in txtEntry:

tests whether badWord equals any substring in textEntry. Since it is a list, it doesn't and can't - what you need to do instead is to check each string in badWord separately. The easiest way to do this is with the function any. You do need to normalise the txtEntry, though, because (as mentioned in the comments) you care about matching exact words, not just substrings (which string in string tests for), and you (probably) want the search to be case insensitive:

import re

for txtEntry in txtList:
    # Ensure that `word in contents` doesn't give 
    # false positives for substrings - avoid eg, 'ass in class'
    contents = [w.lower() for w in re.split('\W+', txtEntry)]

    if any(word in contents for word in badWord):
         myClassifier.append('bad')
    else:
         myClassifer.append('good')

Note that, like other answers, I've used the list.append method instead of += to add the string to the list. If you use +=, your list would end up looking like this: ['g', 'o', 'o', 'd', 'b', 'a', 'd'] instead of ['good', 'bad'].

Per the comments on the question, if you want this to check the file's content when you're only storing its name, you need to adjust this slightly - you need a call to open, and you need to then test against the contents - but the test and the normalisation stay the same:

import re

for txtEntry in txtList:
    with open(txtEntry) as f:
        # Ensure that `word in contents` doesn't give 
        # false positives for substrings - avoid eg, 'ass in class'
        contents = [w.lower() for w in re.split('\W+', f.read())]
    if any(word in contents for word in badWord):
        myClassifier.append('bad')
    else:
        myClassifer.append('good')   

These loops both assume that, as in your sample data, all of the strings in badWord are in lower case.

lvc
  • 34,233
  • 10
  • 73
  • 98
  • 1
    `word in contents` matches all substrings e.g., it finds `ass` in `class` i.e., it mistakenly classifies `class` as a bad word – jfs Mar 26 '14 at 09:08
  • 1
    `.split()` won't catch `ass,` (note: comma) – jfs Mar 26 '14 at 09:14
  • I'm getting an IOError at `with open(txtEntry) as f:`; `IOError: [Errno 2] No such file or directory: 'Text from one of the text files here.'` – Alpine Mar 26 '14 at 10:06
  • @J.F.Sebastian ... right. Edge cases are fun. Updated again. – lvc Mar 26 '14 at 10:06
  • @Alpine the second version of the code assumes each `txtEntry` is a filename (while the first assumes each one is a word). If you've put the file *contents* into your list directly, but not split it into words, you can remove the `with ...:` line completely, and use replace `f.read()` with `txtEntry` (and unindent that line). – lvc Mar 26 '14 at 10:11
  • The search should be case-insensitive, to catch `Ass` – jfs Mar 26 '14 at 10:13
  • @Alpine the first part in my answer now covers the case that each `txtEntry` contains raw contents from the file instead of the filename. – lvc Mar 26 '14 at 10:27
  • @lvc Ah gotcha, just fixed up the txtList. I tried usingthe first part of your answer, however all that the list is receiving is 'bad' (for the next odd 100 entries). To your knowledge (text file content aside), is it working as anticipated? – Alpine Mar 26 '14 at 10:51
  • @Alpine I just noticed a typo in the first part of the answer - when I changed it to split the txtEntry into words, I didn't update the test to use `contents` instead of `txtEntry`. It should hopefully work better now. – lvc Mar 26 '14 at 11:25
  • @lvc Very strangely, the output remains the same (just storing 'bad'). Can you see if there's anything else that requires changing, or could it just be that all my text files contain bad words? – Alpine Mar 26 '14 at 12:54
  • @Alpine I've changed the normalisation logic completely - it now uses `re.split` instead of `str.split`, which is a) simpler, and b) actually works. Using `str.split(sep)` splits at occurrences of *the full separator* only, rather than any character in it. For that IO error, if you give `open` just a filename, it will look for that file in folder you're running your script from; if the files are in a different place, you'll need to adjust that path accordingly. – lvc Mar 26 '14 at 23:30
  • @lvc I just modified `open` to include the file directory, `with open(txtDir + '/' +txtEntry) as f:` so now the IO error has disappeared, however the output remains the same (just 'bad'). – Alpine Mar 27 '14 at 00:55
  • @Alpine I tested it on a minimal example (two files, one bad word which is only in one file) and got the expected output. Have you checked to see if just 'bad' *is* the right output for your current data? – lvc Mar 27 '14 at 01:09
  • @lvc I just tested it by including as the first list to be passed through, 0.txt, containing just one word that absolutely would not be included in the badWord list (it was in lowercase); however 0.txt was still marked as 'bad' (when it really should have been 'good'). – Alpine Mar 27 '14 at 05:39
  • @Alpine try putting this in the same `if`-branch as `append('good')`: `print(next(word for word in badWord if word in contents))`. That will tell you *a* badWord that it thinks is in each file. I assume from your comment that you're reading the badWord list in from somewhere? If so, you may also want to `print(badWord)` and/or `print('presumed-safe-word' in badWord)` before the loop. – lvc Mar 27 '14 at 07:03
  • @lvc Seeing how all of the entries of `myClassifier` were 'bad', it may be better to check what the supposed word in `badWord` was above `myClassifier.append('spam')`. What line would be suitable to print out the presumed bad word found in a file? Just to clarify: I declared `badWord = ['orange', 'box']` above the loop, and then attempted to have it read through every file/entry in `txtList`, particularly the first file listed which I recently created to hold just one word not listed in `badWord`, though it was marked as 'bad'. – Alpine Mar 27 '14 at 12:40
  • @Alpine in my last comment, I did mean to say put that line in the 'bad' branch, not the 'good' one. The line `print(next(...))` from that comment will print the supposed bad word. – lvc Mar 27 '14 at 14:05
  • @lvc My bad; I tried as you suggested by putting `print(next(word for word in badWord if word in contents))` in the 'bad' branch, though I received `StopIteration` on that print line. Which loop would you suggest `print(badWord)` be placed (placing it above the `if` loop just ended up printed all entries for each and every entry in txtFiles)? – Alpine Mar 28 '14 at 08:52
  • @Alpine ... well that's kindof interesting. If that line raises `StopIteration`, then it means the generator `(word for word in badWord if word in contents)` is empty. But that means that `any(word in contents for word in badWord)` would be False. There is *no way* that that can be True but the other generator be empty. Are you sure that you have copied the `if any(..):` line correctly? If you set `badWord = ['some', 'strings']` as you said in your previous comment, there should be no need to `print` it - that suggestion was incase it was being read from some external source. – lvc Mar 28 '14 at 09:46
  • @lvc I have a snippet here: [link](http://www.snipsave.com/user/profile/alpine#7512). As far as I'm aware the `if (any...)` line was copied correctly. – Alpine Mar 29 '14 at 04:14
2

To find which files have bad words in them, you could:

import re
from pprint import pprint

filenames = ['mywords.txt', 'apple.txt', 'banana.txt', 'something.txt']
bad_words = ['pie', 'vegetable', 'fatigue', 'something']

classified_files = {} # filename -> good/bad    
has_bad_words = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, bad_words)),
                           re.I).search
for filename in filenames:
    with open(filename) as file:
         for line in file:
             if has_bad_words(line):
                classified_files[filename] = 'bad'
                break # go to the next file
         else: # no bad words
             classified_files[filename] = 'good'

pprint(classified_files)

If you want to mark as 'bad' the different inflected forms of a word e.g., if cactus is in bad_words and you want to exclude cacti (a plural) then you might need stemmers or more generally lemmatizers e.g.,

from nltk.stem.porter import PorterStemmer # $ pip install nltk

stemmer = PorterStemmer()
print(stemmer.stem("pies")) 
# -> pie

Or

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cacti'))
# -> cactus

Note: you might need import nltk; nltk.download() to download wordnet data.

It might be simpler, just to add all possible forms such as pies, cacti to bad_words list directly.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • The first part worked well for when badWords consisted of say, two entries. When I tried using it with more entries however, I wasn't able to finish compiling it. – Alpine Mar 29 '14 at 04:37
  • @Alpine: what traceback does it show if you try to interrupt it with Ctrl + C? Don't put it in the comments, [update your question](http://stackoverflow.com/posts/22653671/edit) or [ask a new one](http://stackoverflow.com/questions/ask). – jfs Mar 29 '14 at 08:41
  • Where should I be placing that line? – Alpine Mar 29 '14 at 14:57
  • @Alpine: `Ctrl + C` is not a line, it is a keyboard shortcut that you could use to interrupt a script running at the command-line. – jfs Mar 30 '14 at 12:47
  • I tried using Ctrl + C, although I received a Syntax Error using it. – Alpine Mar 30 '14 at 20:41
  • @Alpine: Holding `Ctrl` and pressing `C` in the command-line should interrupt the command that you are running (it most certainly do not lead to SyntaxError). You need a primer on the command line: what it is, and how to do everyday things in it such as how to run Python script if you are a Python developer. Try [Command Line Crash Course](http://learncodethehardway.org/cli/book/cli-crash-course.html) or ask on [SuperUser what `Ctrl+C` is and when you might need it (on the command line)](http://superuser.com/questions/ask). – jfs Mar 30 '14 at 20:53
  • Before I attempt this, would it be possible to achieve a similar response via an integrated development environment (i.e. Aptana Studio)? Just asking because I've been using one to run this code on. – Alpine Mar 31 '14 at 05:23
  • @Alpine: It might be, you need to read Aptana Studio docs specifically. I haven't realized that Ctrl+C might be a problem. The point of using it, is to find out *where your code is stuck* (It is a clever technique: if you [break the execution several times at random moments then the chances are that the interruptions occur in the parts of your code that take most of the time](http://stackoverflow.com/a/378024/4279)). As an alternative, use a profiler: `python -mcProfile your_script` or a debugger `python -mpdb your_script` to see where the code is stuck or just insert print statements – jfs Mar 31 '14 at 11:55
0

You should be looping over badWord items too, and for each item you should check if it exists in txtEntry.

for txtEntry in txtList:
    if any(word in txtEntry for word in badWord)::
        myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
    else:
        myClassifier.append("good")

Thanks to @lvc comment

bingorabbit
  • 665
  • 5
  • 11
  • 1
    This doesn't meet the OP's spec. Need to append "bad" or "good" *once* for every text enty - so, `len(myClassifier) == len(txtList)`. This code will give `len(myClassifier) == len(txtList)*len(badWord)`. – lvc Mar 26 '14 at 08:22
  • 1
    Still won't work. It's now equivalent to `if badWord[0] in txtEntry` (except its a noop rather than an error when `badword` is empty. If the second or third badWord is in txtEntry but the first isn't, this will append "good". – lvc Mar 26 '14 at 08:37
  • @lvc Yup, that's another catch, used any() . – bingorabbit Mar 26 '14 at 08:59
-2

try this code:

    myClassifier.append('bad') 
Manoj Awasthi
  • 3,460
  • 2
  • 22
  • 26