0

I want to get a string that could possibly contain a UCS-2 or UCS-4 emoji code into a MySQL database. The JSON response I get in Python that needs to be sent to MySQL is from the following pseudocode:

response = requests.post("URL", headers=headers, data=data)
responseDict = response.json()
strings = responseDict["data_with_emojis"]  # data looks like u'key': u'value', ...

Python's native str() function fails on emojis, and I can't seem to figure out how to substitute them out of the raw data.

Any solution to getting these codes stringified will suffice, but ideally I'd like to remove/replace them on the Python side of my system. I don't however mind using str_replace() with regex in PHP to remove emoji stringified codes. Point is, these emojis need to be GONE.

How can I remove them?

(I fear my understanding of Unicode and charsets in general are the root of the issue here.)

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    You should show some actual data, in the format you get it, as well as the code you're trying, and the error you get; I don't see why you should be calling `str()` at all. – Daniel Roseman Oct 15 '15 at 08:00
  • 2
    Note that if your only reason for stripping out these characters is that MySQL can't cope with them, note that it's perfectly happy to store non-ascii characters, so maybe you don't need to strip them at all. – Daniel Roseman Oct 15 '15 at 08:01

3 Answers3

0

If you simply want to remove the Unicode emoticons, you can do so with Python:

>>> yourUnicodeString = u'I like answering questions on SO ☺'
>>> print(yourUnicodeString)
>>> print(yourUnicodeString.replace(u'☺', u':-)'))

You might also be interested in

Community
  • 1
  • 1
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
0

The problem is not with emoji, problem with all Unicode chars with codes higher than 127, you will have same problem with for example this letter Ä. You need to figure out how to get unicode out of it correctly. You have correct Unicode strings already: u'key': u'value' so just don't call str over it.

Small example of the way how it should not be done:

>>> x=u'Ä'
>>> str(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 0: ordinal not in range(128)
>>> x.encode('utf8')
'\xc3\x84'

If your question is specifically about Emojis then I will change my answer.

Andrey
  • 59,039
  • 12
  • 119
  • 163
0

Thanks for your help; the advice pointed me in the right direction. Here is the solution that worked for me. What this does is replace all Emojis with blanks ('').

import MySQLdb

emoji_infected_text = "String with UCS-2 and/or UCS-4 codes"

def remove_non_ascii_1(text): return ''.join([i if ord(i) < 128 else '' for i in text])

def remove_non_ascii_2(text): return re.sub(r'[^\x00-\x7F]+','', text)

def remove_non_ascii_3(text): return re.sub(u'[\U00010000-\U0010ffff]+', '', text)

emoji_free_text= MySQLdb.escape_string(remove_non_ascii_3(remove_non_ascii_2(remove_non_ascii_1(emoji_infected_text))))

Obviously you can consolidate this code quite a bit, but I didn't want there to be any confusion for anyone suffering the same problem as me in the future. MySQLdb.escape_string() isn't related to the Emoji removing task, but it's good for making sure your program doesn't fail on inserting confusing characters like quotes or backslashes.