Python, Remove characters, such as emoji, that cannot be handled by UTF8 MySQL DB

Question

How can I replace characters, such as emojis , that cannot be handled by a UTF8 MySQL DB?

The key is to ONLY remove those characters that cannot be handled. I got this code from this answer removing emojis from a string in Python, but it's removing too much. (EDIT: This is the page that I got the code below from remove unicode emoji using re in python)

myre = re.compile(u'('
    u'\ud83c[\udf00-\udfff]|'
    u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
    u'[\u2600-\u26FF\u2700-\u27BF])+', 
    re.UNICODE)

 my_text= myre.sub(r'EMOJI', my_text)

For example, this heart symbol ♥ can be saved to the DB, but is caught by the above regexp.

In case you're not aware, mysql can handle emojis with utf8mb4. — Alex Hall, Jun 07 '16 at 00:12
Yes, thanks! I definitely want to change it at some point, but can't right now take the site down for the hours it may take to restart the MySQL DB. — user984003, Jun 07 '16 at 00:14
That code isn't on the page you link to: is it you that has converted it to match against 16-bit rather than 32-bit wide characters? — Steve Jessop, Jun 07 '16 at 00:41
Ah! Made an edit. It was on another page http://stackoverflow.com/questions/26568722/remove-unicode-emoji-using-re-in-python — user984003, Jun 07 '16 at 00:56

Steve Jessop · Accepted Answer · 2016-06-07T00:56:39.993

3

MySQL's utf8 encodes precisely the basic multilingual plane (BMP). Rather than specifically emoji, you need to exclude all code points from supplementary planes, since in MySQL these require utf8mb4.

Since you appear to be matching against 16 bit rather than 32 bit wide strings, a code point outside the BMP is encoded as a so-called "high surrogate" in the range 0xD800..0xDBFF, followed by a "low surrogate" in the range 0xDC00..0xDFFF. The corresponding regex therefore is:

u'[\ud800-\udbff][\udc00-\udfff]'.

♥ will not match this since it is u'\u2665'. I think strictly speaking it's only an emoji if followed by the variation selector U+FE0F, but either way it's safely in the BMP.

edited Jun 07 '16 at 00:56

answered Jun 07 '16 at 00:45

Steve Jessop

273,490
39
460
699

This seems to work, thanks! That left the heart and some Chinese characters that I tried. The regexp converted it into EMOJIEMOJIEMOJI. Will it always be 3 times? – user984003 Jun 07 '16 at 00:59
@user984003: I can't reproduce your result, I get `re.sub(u'[\ud800-\udbff][\udc00-\udfff]', 'EMOJI', u'\ud83d\ude00') == 'EMOJI'`, just once. I used `u'\ud83d\ude00'` as the input string because D83D DE00 is the surrogate pair for `U+01F600`. – Steve Jessop Jun 07 '16 at 01:06
You're right! I don't know where the extras came from! You answer works great. Thank you! – user984003 Jun 07 '16 at 01:12
This answer does not work when Python is compiled using UCS-4. Use this answer https://stackoverflow.com/a/13752628/984003. – user984003 Jun 02 '17 at 18:19

Quinchilion · Answer 2 · 2016-06-07T00:52:54.267

0

>>> u"abcd ♥ \ud83c".encode("utf-8", errors="replace").decode("utf-8")
'abcd ♥ ?'

edited Jun 07 '16 at 00:52

answered Jun 07 '16 at 00:32

Quinchilion

912
6
16

This gives an error "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)" – user984003 Jun 07 '16 at 00:46
1

Because @Quinchilion is using Python3. In Python2 that'd be `u"abcd ♥ \ud83c".encode("utf-8", errors="replace").decode('utf8')` – Steve Jessop Jun 07 '16 at 00:49
@SteveJessop You're right, thanks. Edited for compatibility. – Quinchilion Jun 07 '16 at 00:53
Thanks for taking the time! This time there is no error, but the emoji isn't replaced. I went ahead and set the other answer as correct. – user984003 Jun 07 '16 at 01:13

Python, Remove characters, such as emoji, that cannot be handled by UTF8 MySQL DB

2 Answers2