Remove characters outside of the BMP (emoji's) in Python 3

Question

I have an error: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk

I'm parsing the data, and some emoji's falls to array. data = 'this variable contains some emoji'sツ' I want: data = 'this variable contains some emoji's'

How I can remove these characters from my data or handle this situation in Python 3?

How about some relevant pieces of code?...and is this related to `Tkinter` ? — Iron Fist, Mar 29 '16 at 12:07
I'm parsing the data, and some emoji's falls to array. data = 'this variable contains some emoji'sツ' I want: data = 'this variable contains some emoji's' — CarloDiPalma, Mar 29 '16 at 12:18
ツ is not an emoji and inside the BMP. You want to remove that too? — deceze, Mar 29 '16 at 12:55

score 11 · Accepted Answer · answered Mar 29 '16 at 13:03

11

If the goal is just to remove all characters above '\uFFFF', the straightforward approach is to do just that:

data = "this variable contains some emoji'sツ"
data = ''.join(c for c in data if c <= '\uFFFF')

It's possible your string is in decomposed form, so you may need to normalize it to composed form first so the non-BMP characters are identifiable:

import unicodedata

data = ''.join(c for c in unicodedata.normalize('NFC', data) if c <= '\uFFFF')

answered Mar 29 '16 at 13:03

ShadowRanger

143,180
12
188
271

For whatever reason this didn't work for me with Python 2.7.5. (I tried both solutions.) The lowercase characters were stripped out. Instead I used `re_pattern = re.compile(u'[^\u0000-\uFFFF]', re.UNICODE); data = re_pattern.sub('', data)` – Eric Klien Oct 06 '16 at 09:18
@EricKlien: This answer was for Python 3; if you tried to use it in Python 2 without a `u` prefix, it would make `str`, not `unicode`, and the comparisons wouldn't work. Your inputs would need to be made `unicode`, and the literal being tested against would need the `u` prefix, making it `u'\uFFFF'`. If you do that, [it works just fine](https://tio.run/##PY4xDoJAEEV7TjHRYsHAFlbGSOspbFZYZAjMkN1BQqsews7DeQFvgICJ0838@f@/dpCSaTuOa0g2CWScI1320EmR7OZLgE3LTgA59oMPciMGUuhWUqKHq3FozrWdbCQGyYPnxoJtuELl3/fH5/W8rf4mpXTFSGEGBTvIAAkWCYtpOcwPp@44jYoCZM2tpXDq1F5y7kQXWFviMIpB9SoGSz/WVC2sKtK9Q7HhnBiN4xc). – ShadowRanger Aug 16 '19 at 21:27

score -2 · Answer 2 · edited May 23 '17 at 10:27

-2

>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, data)
"this variable contains some emoji's"

For BMP read this: removing emojis from a string in Python

edited May 23 '17 at 10:27

Community

1
1

answered Mar 29 '16 at 13:12

Maciej A. Czyzewski

1,539
1
13
24

2

That would eliminate anything outside the printable ASCII range, not just stuff outside the BMP. Also, on Python 3, `filter` returns a generator always, not a `str`/`tuple`/`list` by argument type. Lastly, don't bother with `filter`/`map` if you need a `lambda` to do it; the genexpr/listcomp will be faster and more succinct. `''.join(x for x in data if x in printable)` (though in this case, `''.join(filter(printable.__contains__, data))` would get the speed the `lambda` version doesn't). – ShadowRanger Mar 29 '16 at 13:13
Yea...I was to comment about it as well...only printable ASCII range – Iron Fist Mar 29 '16 at 13:14
@ShadowRanger Ooops... I thought that he wants to remove all non-ascii, not only BMP... sorry – Maciej A. Czyzewski Mar 29 '16 at 13:16

Remove characters outside of the BMP (emoji's) in Python 3

2 Answers2