2

Possible Duplicate:
Remove items from a list while iterating in Python

I have a fairly embedded list: specifically, it is a list of lists of tuples. To simplify things, the entire list is a list of sentences. Within each sentence, each word is made into a tuple, containing information about that word. The last tuple in each sentence contains information about the speaker, but can be removed, if need be.

I'd like to search through these tuples, and, if a certain value is found, then remove the entire sentence.

Here is a sample list:

sentenceList = [[('the', 'det', '1|2|DET'), ('duck', 'n', '2|3|SUBJ'), ('xxx', 'unk', '3|0|ROOT'), ('*MOT', 373)],
                [('yyy', 'unk', '1|0|ROOT'), ('*CHI', 375)], 
                [('what', 'pro', '1|2|OBJ'), ('happen-PAST', 'v', '2|0|ROOT'), ('to', 'prep', '3|2|JCT'), ('the', 'det', '4|5|DET'), ('duck', 'n', '5|3|POBJ'), ('*MOT', 378)], 
                [('boom', 'int', '1|0|ROOT'), ('*CHI', 379)]]

If a sentence contains either 'xxx' or 'yyy', I'd like to remove the entire sentence. The code I tried was:

wordList = ['xxx','yyy']
for sentence in sentenceList:
    for wordTuple in sentence:
        for entry in wordTuple:
            if entry in wordList:
                del sentence

This should delete the entire sentence, i.e:

[('the', 'det', '1|2|DET'), ('duck', 'n', '2|3|SUBJ'), ('xxx', 'unk', '3|0|ROOT'), ('*MOT', 373)], [('yyy', 'unk', '1|0|ROOT'), ('*CHI', 375)]

However, this code doesn't seem to be accomplishing the task. Any idea how to fix it? Thanks!

Community
  • 1
  • 1
Adam_G
  • 7,337
  • 20
  • 86
  • 148
  • 1
    I'm guessing the problem is that you're deleting a member of a list (`sentence`) from that list (`sentenceList`) while you're iterating over the list. [This answer](http://stackoverflow.com/questions/1207406/remove-items-from-a-list-while-iterating-in-python) should help you with ways to deal with that problem. – Sam Mussmann Aug 01 '12 at 19:35
  • Thanks @SamMussmann! I came across that, but wasn't sure how to adapt it to my specific scenario. – Adam_G Aug 01 '12 at 19:42
  • I've added an answer with an adaptation of that answer to your problem. I hope that's helpful. :-) – Sam Mussmann Aug 01 '12 at 21:56

3 Answers3

2
wordList = set(('xxx','yyy'))
for sentence in sentenceList[:]:
    removed = False
    for wordTuple in sentence:
        for entry in wordTuple:
            if entry in wordList:
                sentenceList.remove(sentence)
                removed = True
                break
            # end of if
        # end for each entry
        if removed:
            break
    # end for each word tuple
# end for each sentence

Notes:

  • iterate over a (shallow) copy of the list to avoid the errors that arise from modifying the collection you're traversing
  • remove the object from the list, instead of simply removing the variable name from the local namespace
  • this isn't efficient for large datasets
dsh
  • 12,037
  • 3
  • 33
  • 51
  • a set would be a better datastructure for `wordList`. – mgilson Aug 01 '12 at 19:39
  • Thanks! This certainly does the trick. A couple of questions: (1) What does adding "[:]" to SentenceList do? (2) Any quick suggestions for how to make the code more efficient? I will be applying it to large datasets. (3) For @mgilson, how do I declare it as a set? – Adam_G Aug 01 '12 at 19:42
  • Adding `[:]` makes a shallow copy of the list. You can make a set like `wordList = set(['xxx', 'yyy'])`. – Sam Mussmann Aug 01 '12 at 19:44
  • 2
    `[:]` is [slice syntax](http://stackoverflow.com/q/509211/344286). Take a look at my answer for a more efficient method. And just `set(mylist)` will return a copy of your list turned into a set. I think it's a shallow copy though (so if you have a list inside of the original list, modifying it in one place will change it in the other too. That's kind of a lie, but it's the quickest way to explain it) – Wayne Werner Aug 01 '12 at 19:47
  • @Adam_G -- This is a micro-optimization, but `set(('xxx','yyy'))` will be faster than `set(['xxx','yyy'])`. If you only need a sequence, use a tuple instead of a list. – mgilson Aug 01 '12 at 19:52
  • Thanks @mgilson. What do you mean by "only need a sequence"? – Adam_G Aug 01 '12 at 19:54
  • @Adam_G -- I don't know if I can give a concrete definition. basically, if you think to yourself, I would like a list here, but then realize that you don't actually use any of the properties of a list (like appending/extending/removing items), then you probably want a tuple instead. – mgilson Aug 01 '12 at 19:56
  • @dsh - I actually had to uncheck this answer, as I found it didn't work on my larger list. When I run it, I get the error "ValueError: list.remove(x): x not in list." I think the problem is that the code removes the index, but doesn't move onto the next index, since the shallow copy is still intact. Am I doing something wrong? – Adam_G Aug 01 '12 at 20:07
  • @Adam_G Ooops - it looks like both of us forgot to `break` out of the inner loop once we removed the sentence from the list. Wayne's answer avoids the list copy of the slice and the linear search of the `.remove()` call. I also changed `wordList` to a set, which will improve performance some (more for larger sets than smaller sets). – dsh Aug 01 '12 at 20:37
  • Thanks @dsh, but I still get the same error. I tried breaking multiple times, but without success. – Adam_G Aug 01 '12 at 20:49
  • Try this version. As I pay more attention to this error I realize there are two inner loops, not just one. That complicates it. The `ValueError` on `remove()` indicates that either the object is no longer in the list at that time, or you have an issue with equality comparisons with your class. I don't think it is the latter, since the code in your question uses only built-in types. – dsh Aug 01 '12 at 21:14
1

It's dangerous to try modifying a list while you're iterating over it with for. What you really want is a while loop:

contrived_data = [[(1, 1, 1), ('hello', 'bar')], [(222, 3, 4), ('norweigan', 'blue')], [('anthrax', 'ripple'), (42, 'life')]]

looking_for = (1, 'life')

index = 0
while index < len(contrived_data):
    for two_pull in contrived_data[index]:
        for item in looking_for:
            if item in two_pull:
                print(contrived_data.pop(index))
                index -= 1
                break # Only jumps out of the innermost loop
    index += 1

And that should more efficient for larger datasets than copying your original list.

Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
  • How would this work for a list (or set) of words, though? Since this only checks for "1", how could I alter it to check for a bunch of values? – Adam_G Aug 01 '12 at 20:04
  • This also seems to throw "IndexError: pop index out of range." Am I doing something wrong? – Adam_G Aug 01 '12 at 20:22
  • I added another loop there, along with the appropriate `break`. Does the update still throw the exception? Make sure your indentation level is correct on everything. And, if it still breaks, post your updated code and I'll see what I can do. – Wayne Werner Aug 01 '12 at 21:24
  • Wouldn't you want to do the `for` over `two_pull` and then check if `item` is in `looking_for`? Your way involves iterating over `two_pull` for each item in `looking_for`. And if you make `looking_for` a `set`, then `item in looking_for` won't require Python to iterate through `looking_for`. – Sam Mussmann Aug 01 '12 at 23:04
1

This answer is similar. To apply it, we need a predicate (function of one argument that returns only True or False) that determines whether the entry should stay or not.

Given that we have the target words in a set called wordList:

wordList = set(('xxx', 'yyy'))

This predicate should work:

def keep_sentence(sentence):
    for wordTuple in sentence:
        for entry in wordTuple:
            if entry in wordList:
                return False
    return True  # Only executed if we didn't return false earlier

Now that we have a predicate, we can replace the contents of sentenceList with only the sentences that keep_sentence tells us we should keep:

sentenceList[:] = [x for x in sentenceList if keep_sentence(x)]

As far as applying this to large datasets -- there's probably not going to be a much faster algorithm than this (or one of the other answers) without parallelizing your code. Why? In order to check that each sentence doesn't contain one of the target words, we have to look at each word in each sentence. You may be able to reduce the amount of time you spend on each sentence by some constant factor, but that's not going to help a huge amount.

If you're interested in this, you might want to check out the multiprocessing module, especially process pools.

Community
  • 1
  • 1
Sam Mussmann
  • 5,883
  • 2
  • 29
  • 43