Regex matching non-alphanumeric characters

Question

I'm using Python to parse some strings in a list. Some of the strings may only contain non-alphanumeric characters which I'd like to ignore, like this:

list = ['()', 'desk', 'apple', ':desk', '(house', ')', '(:', ')(', '(', ':(', '))']

for item in list:
    if re.search(r'\W+', item):
        list.remove(item)

# Ideal output
list = ['desk', 'apple', ':desk', '(house']

# Actual output
list = ['desk', 'apple', '(:', '(', '))']

That's my first attempt at the regex for this problem, but it's not really having the desired effect. How would I write a regex to ignore any strings with non-alphanumeric characters?

What result do you expect? This seems to be correct, as the two strings with non-alphanumeric characters have been removed. — , Dec 10 '13 at 16:42
Hmm, I may have misread, but I took your question to be that you only want to exclude strings which are only non-alphanumeric, i.e. you want to leave a string like '(apple)' in the list. Is that correct? — Sean, Dec 10 '13 at 16:50
I've updated my example to show what I'm getting, versus what I'd like. — , Dec 10 '13 at 16:54

K DawG · Answer 1 · 2013-12-10T17:16:48.353

6

BTW your Regex seems to match non-alphanumeric characters. However it isn't advisable to remove items from a list your currently iterating over and that's the cause of this error therefore to overcome this create a new list and append to it the elements which don't match.

enter image description here

Demo:

import re

list = ['()', 'desk', 'apple', ':desk', '(house', ')', '(:', ')(', '(', ':(', '))']
new_list = []

for item in list:
    if not re.search(r'^\W+$', item) or re.search(r'^\w+', item) :
        new_list.append(item)

print new_list

Produces:

['desk', 'apple', ':desk', '(house']

As far as I tested this works in nearly all scenarios.

edited Dec 10 '13 at 17:16

answered Dec 10 '13 at 16:42

K DawG

13,287
9
35
66

\W == [^\w] by definition – njzk2 Dec 10 '13 at 16:49
Thats what I said @njzk2 – K DawG Dec 10 '13 at 16:49

Jon · Answer 2 · 2013-12-10T17:13:36.697

2

What about a list comprehension with re.match(pattern, string):

import re

items = ['()', 'desk', 'apple', ')', '(:', ')(', '(', ':(', '))']
cleaned_items = [item for item in items if re.match('\W?\w+', item)]
print cleaned_items

This prints

['desk', 'apple', ':desk', '(house']

edited Dec 10 '13 at 17:13

answered Dec 10 '13 at 17:01

Jon

11,356
5
40
74

score 0 · Answer 3 · edited May 23 '17 at 10:25

0

The problem is not with your regex. You are iterating over a list which you are then modifying, which causes weirdness (see Modifying list while iterating). You can use a list comprehension like Jon posted, or you can iterate over a copy of the list: for item in list[:]:

edited May 23 '17 at 10:25

Community

1
1

answered Dec 10 '13 at 16:48

Sean

4,450
25
22

The problem is also with the regex. Based on the OP's edited question, they want a regex which will match strings that consist _only_ of non-alphanumeric characters. – Tim Pierce Dec 10 '13 at 17:18

Regex matching non-alphanumeric characters

3 Answers3