0
a = ['pear', 'apple?orange']

or

a = ['pear', 'apple!orange'] 

The question mark and the quotation mark can be any non alphabetic or non numeric character (<,?,<,#).

If I want to remove non alphabetic characters and make the following list:

b = ['apple', 'orange']

How do I do it? do I use a.remove or `a.split?

I'm using python 3.

2 Answers2

1

Use re.split() instead:

import re

not_letters = re.compile(r'[^a-zA-Z]')

b = not_letters.split(a[1])

Demo:

>>> import re
>>> not_letters = re.compile(r'[^a-zA-Z]')
>>> a = ['pear', 'apple?orange']
>>> not_letters.split(a[1])
['apple', 'orange']
>>> a = ['pear', 'apple!orange'] 
>>> not_letters.split(a[1])
['apple', 'orange']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    I suppose I would not be popular if I were to suggest that Unicode contains alphabetic characters other than `a-zA-Z`? ;-) – Steve Jessop Dec 02 '13 at 23:23
  • @SteveJessop And I suppose I would be popular if I were to link to [a previous Stack Overflow answer on building a string of unicode letters](http://stackoverflow.com/a/2127648/1081569) :). – Paulo Almeida Dec 02 '13 at 23:33
  • 1
    @PauloAlmeida: Hmm. I'm wondering (a) whether there are any alphabetic characters outside the BMP, (b) whether it would be better to start with `\w` or `\W` and the UNICODE flag, and then add in / take out digits and underscore. – Steve Jessop Dec 02 '13 at 23:36
  • 1
    Ah, here we go: `re.split('\\W|[\\d_]', 'foo!bar0baz', flags = re.UNICODE)` – Steve Jessop Dec 02 '13 at 23:42
  • @SteveJessop (a) I had no idea, but apparently yes, according to the documentation: "If UNICODE is set, \W will match anything other than [0-9_] plus characters classied as not alphanumeric in the Unicode character properties database". – Paulo Almeida Dec 02 '13 at 23:42
  • @PauloAlmeida: my point (a) is just that if there are alphabetic characters with code points greater than 65535, then that code you link to misses them from its list. – Steve Jessop Dec 02 '13 at 23:43
  • @SteveJessop Yes, I understood. I wouldn't worry too much about it (the person who asked that question ended up using only a subset from a locale, which is reasonable in many cases), but of course your method with \W avoids having to compile that long string altogether. – Paulo Almeida Dec 02 '13 at 23:47
  • All this discussion and it's obvious no one cares about the pear... ;-) – Jon Clements Dec 02 '13 at 23:49
  • is there anyway to do it using list comprehensions and string methods? – user3044013 Dec 02 '13 at 23:50
  • @user3044013: I don't think any of the string methods really help, so you'd have to examine each character in turn. Basically write something similar to `split()` with your own test for whether a character is a separator. – Steve Jessop Dec 03 '13 at 00:01
  • @user3044013: it'd be slow, and you'd probably have to use `itertools.groupby()` if you really wanted to use a list comprehension to do this. You'd loop over the string and group by whether or not each character is a letter or not, the discard every group not consisting of letters, `''.join()` the others. – Martijn Pieters Dec 03 '13 at 08:44
0

If you want a Unicode-aware regex to match non-alphabetic characters:

non_letters = re.compile('[\\W\\d_]', flags = re.UNICODE)
non_letters.split('apple!orange')
non_letters.split('p\xEAche0poire')
Steve Jessop
  • 273,490
  • 39
  • 460
  • 699