Python - remove elements (foreign characters) from list

Question

I have a python list with foreign characters that are denoted by some unicode values:

python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 'chijimu', 'tizimu', 'tidimu', 'to', 'continue', u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'\u30ed\u30fc\u30de\u5b57\uff08\u30ed\u30fc\u30de\u3058\uff09\u3068\u306f\u3001\u4eee\u540d\u6587\u5b57\u3092\u30e9\u30c6\u30f3\u6587\u5b57\u306b\u8ee2\u5199\u3059\u308b\u969b\u306e\u898f\u5247\u5168\u822c\uff08\u30ed\u30fc\u30de\u5b57\u8868\u8a18\u6cd5\uff09\u3001\u307e\u305f\u306f\u30e9\u30c6\u30f3\u6587\u5b57\u3067\u8868\u8a18\u3055\u308c\u305f\u65e5\u672c\u8a9e\uff08\u30ed\u30fc\u30de\u5b57\u3064\u3065\u308a\u306e\u65e5\u672c\u8a9e\uff09\u3092\u8868\u3059\u3002']

I need to remove all the items with '\u7e2e ' or other similar types . If the item in list contains even 1 ascii letter or word , it shouldn't be excluded. for eg: 'China\u3062' should be included. I referred to this question and realized there's something related to values greater than 128. tried different approaches like this one:

new_list = [item for item in python_list if ord(item) < 128]

but this returns an error:

TypeError: ord() expected a character, but string of length 2 found

Expected Output:

new_list = ['to', 'shrink','chijimu', 'tizimu', 'tidimu', 'to', 'continue','tsuzuku', 'tuzuku', 'tuduku']

How should I go about this one??

You need the `is_ascii` function, see [here](http://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii) — georg, Oct 23 '14 at 06:57
TypeError: ord() expected a character, but string of length 2 found — Hypothetical Ninja, Oct 23 '14 at 07:11

score 3 · Accepted Answer · answered Oct 23 '14 at 12:19

If you wish to keep all words that have at least one ascii letter in them then the code below will do this

from string import ascii_letters, punctuation

python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 
               'chijimu','china,', 'tizimu', 'tidimu', 'to', 'continue', 
               u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'china\u3061']

allowed = set(ascii_letters)

output = [word for word in python_list if any(letter in allowed for letter in word)]
print(output)
# ['to',
#  'shrink',
#  'chijimu',
#  'china,',
#  'tizimu',
#  'tidimu',
#  'to',
#  'continue'
#  'tsuzuku',
#  'tuzuku',
#  'tuduku',
#  'china?']

This will iterate through each letter of each word and if a single letter is also contained in allowed then it will add the word to your output list.

score 2 · Answer 2 · answered Oct 23 '14 at 06:53

2

You can approach like this, as you want to keep the strings and remove the unicodes,

new_list = [item for item in python_list if isinstance(item, str)]

answered Oct 23 '14 at 06:53

salmanwahed

9,450
7
32
55

score 1 · Answer 3 · answered Oct 23 '14 at 06:59

1

Here's one way:

import string
python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 'chijimu', 'tizimu', 'tidimu', 'to', 'continue', u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'\u30ed\u30fc\u30de\u5b57\uff08\u30ed\u30fc\u30de\u3058\uff09\u3068\u306f\u3001\u4eee\u540d\u6587\u5b57\u3092\u30e9\u30c6\u30f3\u6587\u5b57\u306b\u8ee2\u5199\u3059\u308b\u969b\u306e\u898f\u5247\u5168\u822c\uff08\u30ed\u30fc\u30de\u5b57\u8868\u8a18\u6cd5\uff09\u3001\u307e\u305f\u306f\u30e9\u30c6\u30f3\u6587\u5b57\u3067\u8868\u8a18\u3055\u308c\u305f\u65e5\u672c\u8a9e\uff08\u30ed\u30fc\u30de\u5b57\u3064\u3065\u308a\u306e\u65e5\u672c\u8a9e\uff09\u3092\u8868\u3059\u3002']
filtered = [s for s in python_list if all(c in string.ascii_letters for c in s)]
print(filtered)

Output:

['to', 'shrink', 'chijimu', 'tizimu', 'tidimu', 'to', 'continue', 'tsuzuku', 'tuzuku', 'tuduku']

answered Oct 23 '14 at 06:59

Mark Tolonen

166,664
26
169
251

this approach is a bit slower than the answer by salman . Could you give an example of any case where your answer will work and the first one won't?? or are they fully similar? – Hypothetical Ninja Oct 23 '14 at 07:04
It's a bit odd that your ASCII strings are byte strings and your foreign strings are Unicode. I didn't notice that at first, but why would there be a mix? Normally a Unicode-aware program will convert all text to Unicode for processing. So this answer would work where all the strings are Unicode and the other would not. – Mark Tolonen Oct 23 '14 at 07:08
thanks , this works.. as expected even on items like u'there' , whereas the first answer removes all items that have u in them. – Hypothetical Ninja Oct 23 '14 at 09:31
but is it possible to remove only those items that are entirely foreign in nature?? for eg , with ur answer , a list item like u'China,' got removed because of the comma. Can this be excluded?? – Hypothetical Ninja Oct 23 '14 at 09:35
@Swordy More appropriate solution for your case. In this solution `string.ascii_letters` is nothing but a string. you can add special character to it according to your need. like `string.ascii_letters+','`. – salmanwahed Oct 23 '14 at 10:25
@Swordy, you could also approve all ASCII characters by using `all(ord(c) < 128 for c in s)`. – Mark Tolonen Oct 23 '14 at 17:21

score 1 · Answer 4 · answered Oct 23 '14 at 07:17

1

Yet another way:

new_list=[]
for word in python_list:
    if word.encode('utf-8').decode('ascii','ignore') !='':
        new_list.append(word)

answered Oct 23 '14 at 07:17

Irshad Bhat

8,479
1
26
36

Python - remove elements (foreign characters) from list

4 Answers4

Linked