Convert unicode string representation of emoji to unicode emoji in python

Question

I'm using Python2 on Spark (PySpark and Pandas) to analyze data about emoji usage. I have a string like u'u+1f375' or u'u+1f618' that I want to convert to and respectively.

I've read several other SO posts and the unicode HOWTO, trying to grasp encode and decode to no avail.

This didn't work:

decode_udf = udf(lambda x: x.decode('unicode-escape'))
foo = emojis.withColumn('decoded_emoji', decode_udf(emojis.emoji))
Result: decoded_emoji=u'u+1f618'

This ended up working on a one-off basis, but fails the moment I apply it to my RDD.

def rename_if_emoji(pattern):
  """rename the element name of dataframe with emoji"""

  if pattern.lower().startswith("u+"):
    emoji_string = ""
    EMOJI_PREFIX = "u+"
    for part_org in pattern.lower().split(" "):
      part = part_org.strip();
      if (part.startswith(EMOJI_PREFIX)):
        padding = "0" * (8 + len(EMOJI_PREFIX) - len(part)) 
        codepoint = '\U' + padding + part[len(EMOJI_PREFIX):]
        print("codepoint: " + codepoint)
        emoji_string += codepoint.decode('unicode-escape')
        print("emoji_string: " + emoji_string)
    return emoji_string
  else:
    return pattern

rename_if_emoji_udf = udf(rename_if_emoji)

Error: UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f618' in position 14: ordinal not in range(128)

Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) — nutic, May 02 '18 at 15:10
In python 2, u'u+1f618' is exactly the same thing (conceptually) as . It sounds like maybe your problem is that something (your terminal maybe?) isn't *rendering* the character the way you want? Can you give more detail on what you want to happen and where, and what is actually happening? — Tom Dalton, May 02 '18 at 15:44

Mark Tolonen · Accepted Answer · 2018-05-02T16:50:20.897

The ability to print emoji correctly depends on the IDE/terminal used. You'll get a UnicodeEncodeError on an unsupported terminal due to Python 2's print encoding Unicode strings to the terminal's encoding. You also need font support. You're error is on the print. You've decoded it correctly but your output device ideally should support UTF-8.

The example simplifies the decoding process. I print the repr() of the string in case the terminal isn't configured to support the characters being printed.

import re

def replacement(m):
    '''Assume the matched characters are hexadecimal, convert to integer,
       format appropriately, and decode back to Unicode.
    '''
    i = int(m.group(1),16)
    return '\\U{:08X}'.format(i).decode('unicode-escape')

def replace(s):
    '''Replace all u+nnnn strings with the Unicode equivalent.
    '''
    return re.sub(ur'u\+([0-9a-fA-F]+)',replacement,s)

s = u'u+1f618 u+1f375'
t = replace(s)
print repr(t)
print t

Output (on a UTF-8 IDE):

u'\U0001f618 \U0001f375'

Convert unicode string representation of emoji to unicode emoji in python

1 Answers1