How can "plaintext" Java source codepoints be programmatically converted to Emoji in Python3?

Question

I've written a Python3 script to extract strings of C/C++/Java source codepoints/surrogate pairs for emoji characters (\ud83d\ude00 for , for example) from a text file.

I also have a dictionary in this script mapping emoji to their descriptions ("" => "grinning face"). How can I convert the surrogate pairs (\ud83d\ude00, string literal) to their emoji counterparts in order to use them as keys to access the corresponding emojis' descriptions in the dictionary?

For some additional information, I'm extracting the strings in such a way that when I run print(extracted_string), the console output is \ud83d\ude00. When I attempt to assign the value at the emoji key to a variable, I get back an error:

description = dictionary[extracted_string]
KeyError: '\\ud83d\\ude00'

score 2 · Accepted Answer · answered Feb 21 '18 at 18:31

2

This is the same as JSON's encoding, too.

>>> import json
>>> json.loads('"\\ud83d\\ude00"')
''

answered Feb 21 '18 at 18:31

Josh Lee

171,072
38
269
275

For anyone else looking for this answer - the string *must* be formatted as above, with quotes around the string literal of the surrogate pairs, so if the variable `emoji` is assigned the string literal value `\ud83d\ude00`, it'd be necessary to set `emoji = '"' + emoji + '"'. Thank you for the answer, Josh! – sidd flinch Feb 22 '18 at 14:37

score 0 · Answer 2 · answered Feb 21 '18 at 18:22

It took some digging and a whole bunch of encoding/decoding, but I've found something that works:

extracted_string = '\\ud83d\\ude00' #String literal as read from file
emoji = extracted_string.encode().decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')
print(emoji)

Output:

Which is slightly modified from @falestru's answer here: https://stackoverflow.com/a/26311382/1082235

How can "plaintext" Java source codepoints be programmatically converted to Emoji in Python3?

2 Answers2