0

I am using python to extract comments and display them. It looks like this when I print it.

This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f

How do I convert the unicodes of the emojis into its respective CLDR Short Name? For example, U+1F44D will print as thumbs up.

Anjali
  • 1
  • 2

1 Answers1

3

EDIT: I think I found solution for problem with codes \ud83d\udc9c

text = text.encode('utf-16', 'surrogatepass').decode('utf-16')

It converts surrogate value \ud83d\udc9c to correct emoji value \U0001f49c

Source: How to work with surrogate pairs in Python?

Wikipedia: Surrogate

Other: Unicode character inspector


Using Google I found

print('\U0001F44D'.encode('ascii', 'namereplace').decode())

Result

\N{THUMBS UP SIGN}

And

import unicodedata

print(unicodedata.name('\U0001F44D'))

Result:

THUMBS UP SIGN

So it is good to use Google before you ask on Stackoverflow.

https://docs.python.org/3/howto/unicode.html


The same for text

text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''

print(text.encode('ascii', 'namereplace').decode())

Result:

This was heart wrenching \N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
\N{THUMBS UP SIGN}

Now you may have to remove \N{ and }

But it has problem with \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c


You can use unicodedata in for-loop to get names for every char in text but it may have problem if it has no name ie. '\n'. And it gives names also for normal chars so you may have to use unicodedata.category() to decide which chars to replace.

This also has problem with \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c

import unicodedata

# http://www.unicode.org/reports/tr44/#General_Category_Values

for char in text:
    try:
        print(char, '|', unicodedata.category(char), '|', unicodedata.name(char))
    except ValueError:
        print(repr(char), '| (repr)')

Result:

T | Lu | LATIN CAPITAL LETTER T
h | Ll | LATIN SMALL LETTER H
i | Ll | LATIN SMALL LETTER I
s | Ll | LATIN SMALL LETTER S
  | Zs | SPACE
w | Ll | LATIN SMALL LETTER W
a | Ll | LATIN SMALL LETTER A
s | Ll | LATIN SMALL LETTER S
  | Zs | SPACE
h | Ll | LATIN SMALL LETTER H
e | Ll | LATIN SMALL LETTER E
a | Ll | LATIN SMALL LETTER A
r | Ll | LATIN SMALL LETTER R
t | Ll | LATIN SMALL LETTER T
  | Zs | SPACE
w | Ll | LATIN SMALL LETTER W
r | Ll | LATIN SMALL LETTER R
e | Ll | LATIN SMALL LETTER E
n | Ll | LATIN SMALL LETTER N
c | Ll | LATIN SMALL LETTER C
h | Ll | LATIN SMALL LETTER H
i | Ll | LATIN SMALL LETTER I
n | Ll | LATIN SMALL LETTER N
g | Ll | LATIN SMALL LETTER G
  | Zs | SPACE
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
'\n' | (repr)
A | Lu | LATIN CAPITAL LETTER A
m | Ll | LATIN SMALL LETTER M
a | Ll | LATIN SMALL LETTER A
z | Ll | LATIN SMALL LETTER Z
i | Ll | LATIN SMALL LETTER I
n | Ll | LATIN SMALL LETTER N
g | Ll | LATIN SMALL LETTER G
  | Zs | SPACE
c | Ll | LATIN SMALL LETTER C
o | Ll | LATIN SMALL LETTER O
m | Ll | LATIN SMALL LETTER M
p | Ll | LATIN SMALL LETTER P
a | Ll | LATIN SMALL LETTER A
s | Ll | LATIN SMALL LETTER S
s | Ll | LATIN SMALL LETTER S
i | Ll | LATIN SMALL LETTER I
o | Ll | LATIN SMALL LETTER O
n | Ll | LATIN SMALL LETTER N
  | Zs | SPACE
'\ud83d' | (repr)
'\udc9c' | (repr)
'\ud83d' | (repr)
'\udc9c' | (repr)
'\ud83d' | (repr)
'\udc9c' | (repr)
  | Zs | SPACE
# | Po | NUMBER SIGN
t | Ll | LATIN SMALL LETTER T
e | Ll | LATIN SMALL LETTER E
a | Ll | LATIN SMALL LETTER A
r | Ll | LATIN SMALL LETTER R
s | Ll | LATIN SMALL LETTER S
'\n' | (repr)
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16

Because it has problem with \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c so I replace it with ?

import unicodedata

text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''

result = []

for char in text:
    if unicodedata.category(char) in ('So', 'Mn'):
        result.append(':{}:'.format(unicodedata.name(char)))
    elif unicodedata.category(char) in ('Cs'):
        result.append('?') #char)
    else:
        result.append(char)

print(''.join(result)) 

Result:

This was heart wrenching :HEAVY BLACK HEART::VARIATION SELECTOR-16:
Amazing compassion ?????? #tears
:HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16:

EDIT: using Google again I found external module emoji which can convert some names but it also has problem with \ud83d\udc9c so I used repr to display it - but it also print new line as \n

text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''

import emoji

#print( repr(emoji.demojize(text, use_aliases=True)) ) 
print( repr(emoji.demojize(text)) ) 

Result:

'This was heart wrenching :heart:\nAmazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears\n:heart::heart::heart:'

http://www.unicode.org/emoji/charts/full-emoji-list.html

https://www.webfx.com/tools/emoji-cheat-sheet/

http://unicode.org/Public/emoji/12.0/emoji-test.txt


BTW: I found module demoji which can find emoji and gives names. But it also has problem with code \ud83d\udc9c

import demoji

# run only once after installing module
demoji.download_codes()

print(demoji.findall(text))

It needs demoji.download_codes() only once - after installing module.

Result:

{'❤️': 'red heart'}

If you get it as JSON data "\ud83d\udc9c" then you shouldn't have problem - it should convert it automatically

import json

# escaped unicode in " "  
data = r'"\ud83d\udc9c"' 
print(json.loads(data))

In other situation you would have to convert it

# convert to escaped unicode and put in " "  
data = '"{}"'.format('\ud83d\udc9c'.encode('unicode-escape').decode())
print(json.loads(data))

How to work with surrogate pairs in Python?

furas
  • 134,197
  • 12
  • 106
  • 148