EDIT: I think I found solution for problem with codes \ud83d\udc9c
text = text.encode('utf-16', 'surrogatepass').decode('utf-16')
It converts surrogate value \ud83d\udc9c
to correct emoji value \U0001f49c
Source: How to work with surrogate pairs in Python?
Wikipedia: Surrogate
Other: Unicode character inspector
Using Google I found
print('\U0001F44D'.encode('ascii', 'namereplace').decode())
Result
\N{THUMBS UP SIGN}
And
import unicodedata
print(unicodedata.name('\U0001F44D'))
Result:
THUMBS UP SIGN
So it is good to use Google
before you ask on Stackoverflow.
https://docs.python.org/3/howto/unicode.html
The same for text
text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
print(text.encode('ascii', 'namereplace').decode())
Result:
This was heart wrenching \N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
\N{THUMBS UP SIGN}
Now you may have to remove \N{
and }
But it has problem with \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c
You can use unicodedata
in for
-loop to get names for every char in text but it may have problem if it has no name ie. '\n'
. And it gives names also for normal chars so you may have to use unicodedata.category()
to decide which chars to replace.
This also has problem with \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c
import unicodedata
# http://www.unicode.org/reports/tr44/#General_Category_Values
for char in text:
try:
print(char, '|', unicodedata.category(char), '|', unicodedata.name(char))
except ValueError:
print(repr(char), '| (repr)')
Result:
T | Lu | LATIN CAPITAL LETTER T
h | Ll | LATIN SMALL LETTER H
i | Ll | LATIN SMALL LETTER I
s | Ll | LATIN SMALL LETTER S
| Zs | SPACE
w | Ll | LATIN SMALL LETTER W
a | Ll | LATIN SMALL LETTER A
s | Ll | LATIN SMALL LETTER S
| Zs | SPACE
h | Ll | LATIN SMALL LETTER H
e | Ll | LATIN SMALL LETTER E
a | Ll | LATIN SMALL LETTER A
r | Ll | LATIN SMALL LETTER R
t | Ll | LATIN SMALL LETTER T
| Zs | SPACE
w | Ll | LATIN SMALL LETTER W
r | Ll | LATIN SMALL LETTER R
e | Ll | LATIN SMALL LETTER E
n | Ll | LATIN SMALL LETTER N
c | Ll | LATIN SMALL LETTER C
h | Ll | LATIN SMALL LETTER H
i | Ll | LATIN SMALL LETTER I
n | Ll | LATIN SMALL LETTER N
g | Ll | LATIN SMALL LETTER G
| Zs | SPACE
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
'\n' | (repr)
A | Lu | LATIN CAPITAL LETTER A
m | Ll | LATIN SMALL LETTER M
a | Ll | LATIN SMALL LETTER A
z | Ll | LATIN SMALL LETTER Z
i | Ll | LATIN SMALL LETTER I
n | Ll | LATIN SMALL LETTER N
g | Ll | LATIN SMALL LETTER G
| Zs | SPACE
c | Ll | LATIN SMALL LETTER C
o | Ll | LATIN SMALL LETTER O
m | Ll | LATIN SMALL LETTER M
p | Ll | LATIN SMALL LETTER P
a | Ll | LATIN SMALL LETTER A
s | Ll | LATIN SMALL LETTER S
s | Ll | LATIN SMALL LETTER S
i | Ll | LATIN SMALL LETTER I
o | Ll | LATIN SMALL LETTER O
n | Ll | LATIN SMALL LETTER N
| Zs | SPACE
'\ud83d' | (repr)
'\udc9c' | (repr)
'\ud83d' | (repr)
'\udc9c' | (repr)
'\ud83d' | (repr)
'\udc9c' | (repr)
| Zs | SPACE
# | Po | NUMBER SIGN
t | Ll | LATIN SMALL LETTER T
e | Ll | LATIN SMALL LETTER E
a | Ll | LATIN SMALL LETTER A
r | Ll | LATIN SMALL LETTER R
s | Ll | LATIN SMALL LETTER S
'\n' | (repr)
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
Because it has problem with \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c
so I replace it with ?
import unicodedata
text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
result = []
for char in text:
if unicodedata.category(char) in ('So', 'Mn'):
result.append(':{}:'.format(unicodedata.name(char)))
elif unicodedata.category(char) in ('Cs'):
result.append('?') #char)
else:
result.append(char)
print(''.join(result))
Result:
This was heart wrenching :HEAVY BLACK HEART::VARIATION SELECTOR-16:
Amazing compassion ?????? #tears
:HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16:
EDIT: using Google again I found external module emoji which can convert some names but it also has problem with \ud83d\udc9c
so I used repr
to display it - but it also print new line as \n
text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
import emoji
#print( repr(emoji.demojize(text, use_aliases=True)) )
print( repr(emoji.demojize(text)) )
Result:
'This was heart wrenching :heart:\nAmazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears\n:heart::heart::heart:'
http://www.unicode.org/emoji/charts/full-emoji-list.html
https://www.webfx.com/tools/emoji-cheat-sheet/
http://unicode.org/Public/emoji/12.0/emoji-test.txt
BTW: I found module demoji which can find emoji and gives names. But it also has problem with code \ud83d\udc9c
import demoji
# run only once after installing module
demoji.download_codes()
print(demoji.findall(text))
It needs demoji.download_codes()
only once - after installing module.
Result:
{'❤️': 'red heart'}
If you get it as JSON data "\ud83d\udc9c"
then you shouldn't have problem - it should convert it automatically
import json
# escaped unicode in " "
data = r'"\ud83d\udc9c"'
print(json.loads(data))
In other situation you would have to convert it
# convert to escaped unicode and put in " "
data = '"{}"'.format('\ud83d\udc9c'.encode('unicode-escape').decode())
print(json.loads(data))
How to work with surrogate pairs in Python?