3

I'm using Python to parse some strings in a list. Some of the strings may only contain non-alphanumeric characters which I'd like to ignore, like this:

list = ['()', 'desk', 'apple', ':desk', '(house', ')', '(:', ')(', '(', ':(', '))']

for item in list:
    if re.search(r'\W+', item):
        list.remove(item)

# Ideal output
list = ['desk', 'apple', ':desk', '(house']

# Actual output
list = ['desk', 'apple', '(:', '(', '))']

That's my first attempt at the regex for this problem, but it's not really having the desired effect. How would I write a regex to ignore any strings with non-alphanumeric characters?

  • 2
    What result do you expect? This seems to be correct, as the two strings with non-alphanumeric characters have been removed. –  Dec 10 '13 at 16:42
  • Hmm, I may have misread, but I took your question to be that you only want to exclude strings which are only non-alphanumeric, i.e. you want to leave a string like '(apple)' in the list. Is that correct? – Sean Dec 10 '13 at 16:50
  • I've updated my example to show what I'm getting, versus what I'd like. –  Dec 10 '13 at 16:54

3 Answers3

6

BTW your Regex seems to match non-alphanumeric characters. However it isn't advisable to remove items from a list your currently iterating over and that's the cause of this error therefore to overcome this create a new list and append to it the elements which don't match.

enter image description here

Demo:

import re

list = ['()', 'desk', 'apple', ':desk', '(house', ')', '(:', ')(', '(', ':(', '))']
new_list = []

for item in list:
    if not re.search(r'^\W+$', item) or re.search(r'^\w+', item) :
        new_list.append(item)

print new_list

Produces:

['desk', 'apple', ':desk', '(house']

As far as I tested this works in nearly all scenarios.

K DawG
  • 13,287
  • 9
  • 35
  • 66
2

What about a list comprehension with re.match(pattern, string):

import re

items = ['()', 'desk', 'apple', ')', '(:', ')(', '(', ':(', '))']
cleaned_items = [item for item in items if re.match('\W?\w+', item)]
print cleaned_items

This prints

['desk', 'apple', ':desk', '(house']
Jon
  • 11,356
  • 5
  • 40
  • 74
0

The problem is not with your regex. You are iterating over a list which you are then modifying, which causes weirdness (see Modifying list while iterating). You can use a list comprehension like Jon posted, or you can iterate over a copy of the list: for item in list[:]:

Community
  • 1
  • 1
Sean
  • 4,450
  • 25
  • 22
  • The problem is also with the regex. Based on the OP's edited question, they want a regex which will match strings that consist _only_ of non-alphanumeric characters. – Tim Pierce Dec 10 '13 at 17:18