I'm using Python2 on Spark (PySpark and Pandas) to analyze data about emoji usage. I have a string like u'u+1f375'
or u'u+1f618'
that I want to convert to and
respectively.
I've read several other SO posts and the unicode HOWTO, trying to grasp encode
and decode
to no avail.
This didn't work:
decode_udf = udf(lambda x: x.decode('unicode-escape'))
foo = emojis.withColumn('decoded_emoji', decode_udf(emojis.emoji))
Result: decoded_emoji=u'u+1f618'
This ended up working on a one-off basis, but fails the moment I apply it to my RDD.
def rename_if_emoji(pattern):
"""rename the element name of dataframe with emoji"""
if pattern.lower().startswith("u+"):
emoji_string = ""
EMOJI_PREFIX = "u+"
for part_org in pattern.lower().split(" "):
part = part_org.strip();
if (part.startswith(EMOJI_PREFIX)):
padding = "0" * (8 + len(EMOJI_PREFIX) - len(part))
codepoint = '\U' + padding + part[len(EMOJI_PREFIX):]
print("codepoint: " + codepoint)
emoji_string += codepoint.decode('unicode-escape')
print("emoji_string: " + emoji_string)
return emoji_string
else:
return pattern
rename_if_emoji_udf = udf(rename_if_emoji)
Error: UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f618' in position 14: ordinal not in range(128)