0

I'm using Python2 on Spark (PySpark and Pandas) to analyze data about emoji usage. I have a string like u'u+1f375' or u'u+1f618' that I want to convert to and respectively.

I've read several other SO posts and the unicode HOWTO, trying to grasp encode and decode to no avail.

This didn't work:

decode_udf = udf(lambda x: x.decode('unicode-escape'))
foo = emojis.withColumn('decoded_emoji', decode_udf(emojis.emoji))
Result: decoded_emoji=u'u+1f618'

This ended up working on a one-off basis, but fails the moment I apply it to my RDD.

def rename_if_emoji(pattern):
  """rename the element name of dataframe with emoji"""

  if pattern.lower().startswith("u+"):
    emoji_string = ""
    EMOJI_PREFIX = "u+"
    for part_org in pattern.lower().split(" "):
      part = part_org.strip();
      if (part.startswith(EMOJI_PREFIX)):
        padding = "0" * (8 + len(EMOJI_PREFIX) - len(part)) 
        codepoint = '\U' + padding + part[len(EMOJI_PREFIX):]
        print("codepoint: " + codepoint)
        emoji_string += codepoint.decode('unicode-escape')
        print("emoji_string: " + emoji_string)
    return emoji_string
  else:
    return pattern

rename_if_emoji_udf = udf(rename_if_emoji)

Error: UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f618' in position 14: ordinal not in range(128)

Peter
  • 1,065
  • 14
  • 29
  • Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – nutic May 02 '18 at 15:10
  • In python 2, u'u+1f618' is exactly the same thing (conceptually) as . It sounds like maybe your problem is that something (your terminal maybe?) isn't *rendering* the character the way you want? Can you give more detail on what you want to happen and where, and what is actually happening? – Tom Dalton May 02 '18 at 15:44
  • @TomDalton No, u'\U0001f618' is the same as u''. – Mark Tolonen May 02 '18 at 16:14
  • Why not using a simple `.replace` on the unicode text? – Giacomo Catenazzi May 02 '18 at 16:42

1 Answers1

1

The ability to print emoji correctly depends on the IDE/terminal used. You'll get a UnicodeEncodeError on an unsupported terminal due to Python 2's print encoding Unicode strings to the terminal's encoding. You also need font support. You're error is on the print. You've decoded it correctly but your output device ideally should support UTF-8.

The example simplifies the decoding process. I print the repr() of the string in case the terminal isn't configured to support the characters being printed.

import re

def replacement(m):
    '''Assume the matched characters are hexadecimal, convert to integer,
       format appropriately, and decode back to Unicode.
    '''
    i = int(m.group(1),16)
    return '\\U{:08X}'.format(i).decode('unicode-escape')

def replace(s):
    '''Replace all u+nnnn strings with the Unicode equivalent.
    '''
    return re.sub(ur'u\+([0-9a-fA-F]+)',replacement,s)

s = u'u+1f618 u+1f375'
t = replace(s)
print repr(t)
print t

Output (on a UTF-8 IDE):

u'\U0001f618 \U0001f375'
 
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251