4

I have an error: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk

I'm parsing the data, and some emoji's falls to array. data = 'this variable contains some emoji'sツ' I want: data = 'this variable contains some emoji's'

How I can remove these characters from my data or handle this situation in Python 3?

CarloDiPalma
  • 63
  • 2
  • 10

2 Answers2

11

If the goal is just to remove all characters above '\uFFFF', the straightforward approach is to do just that:

data = "this variable contains some emoji'sツ"
data = ''.join(c for c in data if c <= '\uFFFF')

It's possible your string is in decomposed form, so you may need to normalize it to composed form first so the non-BMP characters are identifiable:

import unicodedata

data = ''.join(c for c in unicodedata.normalize('NFC', data) if c <= '\uFFFF')
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • For whatever reason this didn't work for me with Python 2.7.5. (I tried both solutions.) The lowercase characters were stripped out. Instead I used `re_pattern = re.compile(u'[^\u0000-\uFFFF]', re.UNICODE); data = re_pattern.sub('', data)` – Eric Klien Oct 06 '16 at 09:18
  • @EricKlien: This answer was for Python 3; if you tried to use it in Python 2 without a `u` prefix, it would make `str`, not `unicode`, and the comparisons wouldn't work. Your inputs would need to be made `unicode`, and the literal being tested against would need the `u` prefix, making it `u'\uFFFF'`. If you do that, [it works just fine](https://tio.run/##PY4xDoJAEEV7TjHRYsHAFlbGSOspbFZYZAjMkN1BQqsews7DeQFvgICJ0838@f@/dpCSaTuOa0g2CWScI1320EmR7OZLgE3LTgA59oMPciMGUuhWUqKHq3FozrWdbCQGyYPnxoJtuELl3/fH5/W8rf4mpXTFSGEGBTvIAAkWCYtpOcwPp@44jYoCZM2tpXDq1F5y7kQXWFviMIpB9SoGSz/WVC2sKtK9Q7HhnBiN4xc). – ShadowRanger Aug 16 '19 at 21:27
-2
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, data)
"this variable contains some emoji's"

For BMP read this: removing emojis from a string in Python

Community
  • 1
  • 1
Maciej A. Czyzewski
  • 1,539
  • 1
  • 13
  • 24
  • 2
    That would eliminate anything outside the printable ASCII range, not just stuff outside the BMP. Also, on Python 3, `filter` returns a generator always, not a `str`/`tuple`/`list` by argument type. Lastly, don't bother with `filter`/`map` if you need a `lambda` to do it; the genexpr/listcomp will be faster and more succinct. `''.join(x for x in data if x in printable)` (though in this case, `''.join(filter(printable.__contains__, data))` would get the speed the `lambda` version doesn't). – ShadowRanger Mar 29 '16 at 13:13
  • Yea...I was to comment about it as well...only printable ASCII range – Iron Fist Mar 29 '16 at 13:14
  • @ShadowRanger Ooops... I thought that he wants to remove all non-ascii, not only BMP... sorry – Maciej A. Czyzewski Mar 29 '16 at 13:16