Removing items in a list

Question

a = ['pear', 'apple?orange']

or

a = ['pear', 'apple!orange']

The question mark and the quotation mark can be any non alphabetic or non numeric character (<,?,<,#).

If I want to remove non alphabetic characters and make the following list:

b = ['apple', 'orange']

How do I do it? do I use a.remove or `a.split?

I'm using python 3.

What happened to `'pear'` here, did you mean to process the all values in `a`? — Martijn Pieters, Dec 02 '13 at 23:05

Martijn Pieters · Answer 1 · 2013-12-02T23:05:44.657

1

Use re.split() instead:

import re

not_letters = re.compile(r'[^a-zA-Z]')

b = not_letters.split(a[1])

Demo:

>>> import re
>>> not_letters = re.compile(r'[^a-zA-Z]')
>>> a = ['pear', 'apple?orange']
>>> not_letters.split(a[1])
['apple', 'orange']
>>> a = ['pear', 'apple!orange'] 
>>> not_letters.split(a[1])
['apple', 'orange']

edited Dec 02 '13 at 23:05

answered Dec 02 '13 at 22:58

Martijn Pieters

1,048,767
296
4,058
3,343

1

I suppose I would not be popular if I were to suggest that Unicode contains alphabetic characters other than `a-zA-Z`? ;-) – Steve Jessop Dec 02 '13 at 23:23
@SteveJessop And I suppose I would be popular if I were to link to [a previous Stack Overflow answer on building a string of unicode letters](http://stackoverflow.com/a/2127648/1081569) :). – Paulo Almeida Dec 02 '13 at 23:33
1

@PauloAlmeida: Hmm. I'm wondering (a) whether there are any alphabetic characters outside the BMP, (b) whether it would be better to start with `\w` or `\W` and the UNICODE flag, and then add in / take out digits and underscore. – Steve Jessop Dec 02 '13 at 23:36
1

Ah, here we go: `re.split('\\W|[\\d_]', 'foo!bar0baz', flags = re.UNICODE)` – Steve Jessop Dec 02 '13 at 23:42
@SteveJessop (a) I had no idea, but apparently yes, according to the documentation: "If UNICODE is set, \W will match anything other than [0-9_] plus characters classied as not alphanumeric in the Unicode character properties database". – Paulo Almeida Dec 02 '13 at 23:42
@PauloAlmeida: my point (a) is just that if there are alphabetic characters with code points greater than 65535, then that code you link to misses them from its list. – Steve Jessop Dec 02 '13 at 23:43
@SteveJessop Yes, I understood. I wouldn't worry too much about it (the person who asked that question ended up using only a subset from a locale, which is reasonable in many cases), but of course your method with \W avoids having to compile that long string altogether. – Paulo Almeida Dec 02 '13 at 23:47
All this discussion and it's obvious no one cares about the pear... ;-) – Jon Clements Dec 02 '13 at 23:49
is there anyway to do it using list comprehensions and string methods? – user3044013 Dec 02 '13 at 23:50
@user3044013: I don't think any of the string methods really help, so you'd have to examine each character in turn. Basically write something similar to `split()` with your own test for whether a character is a separator. – Steve Jessop Dec 03 '13 at 00:01
@user3044013: it'd be slow, and you'd probably have to use `itertools.groupby()` if you really wanted to use a list comprehension to do this. You'd loop over the string and group by whether or not each character is a letter or not, the discard every group not consisting of letters, `''.join()` the others. – Martijn Pieters Dec 03 '13 at 08:44

score 0 · Answer 2 · answered Dec 02 '13 at 23:55

0

If you want a Unicode-aware regex to match non-alphabetic characters:

non_letters = re.compile('[\\W\\d_]', flags = re.UNICODE)
non_letters.split('apple!orange')
non_letters.split('p\xEAche0poire')

answered Dec 02 '13 at 23:55

Steve Jessop

273,490
39
460
699

Removing items in a list

2 Answers2