Unicode Characters in Twitter (Python)

Question

I've learned how to send tweets with Python, but I'm wondering if it's possible to send emojis or other special Unicode characters in the tweets.

For example, when I try to tweet u'1F430', it simply shows up as "1F430" in the tweet.

'1F430' is still a series of five alphanumeric characters whether you mark it as unicode or not. What character are you actually trying to send? — Daniel Roseman, Aug 12 '15 at 14:34
That was just an example, but that '1F430' should be a bunny emoji. How do I get a computer to read that as one character then? — codycrossley, Aug 12 '15 at 14:37
@mata, yes! How should I pass that into Python so that it reads it how I want it to? EDIT: Nevermind, your answer actually answers that. Thank you so much! — codycrossley, Aug 12 '15 at 14:37
@codycrossley do you use python2 or python3? there are a lot of differences regarding unicode handling between those versions, and there are different [possible escape sequences](https://docs.python.org/3/howto/unicode.html#unicode-literals-in-python-source-code), which can be used depending on the needed byte size for the unicode code point... — mata, Aug 12 '15 at 14:48
@mata, I generally use python2, but will eventually make the switch to python3. Thank you for the reference! — codycrossley, Aug 12 '15 at 15:09

score 2 · Accepted Answer · edited May 23 '17 at 11:51

>>> len(u'1f430')
5
>>> len(u'\U0001F430') 
1 # the latter might be equal to two in Python 2 on a narrow build (Windows, OS X)

The former is 5 characters, the latter is a single character.

If you want to specify the character in Python source code then you could use its name for readability:

>>> print(u"\N{RABBIT FACE}")

Note: it might not work in Windows console. To display non-BMP Unicode characters there, you could use win-unicode-console + ConEmu.

If you are reading it from a file, network, etc then this character is no different from any other: to decode bytes into Unicode text, you should specify a character encoding e.g.:

import io

with io.open('filename', encoding='utf-8') as file:
    text = file.read()

Which specific encoding to use depends on the source e.g., see A good way to get the charset/encoding of an HTTP response in Python

Tom Dalton · Answer 2 · 2015-08-13T21:54:36.367

1

u'1F430' is the literal string "1F430". What character are you trying to get? In general you can get literal bytes into a python string using "\x20", e.g.

>>> print(b"#\x20#")
# #

The byte with hexadecimal value of 20 (decimal 32) in between 2 hashes. Bytes are decoded as ASCII by default, and ASCII char (hex) 20 is a space.

>>> print(u"#\u0020#")
# #
>>> print(u"#\U0001F430#")
# #

Unicode codepoint 20 (a single space) in the middle of 2 hashes

See https://docs.python.org/3.3/howto/unicode.html for more info. NB It can get a little confusing since python will implicitly convert between bytes and unicode (using the ASCII encoding) in a lot of cases, which can hide the issue from you for a while.

edited Aug 13 '15 at 21:54

answered Aug 12 '15 at 14:37

Tom Dalton

6,122
24
35

for this code point a 4-byte escape sequence isn't enough, you need a 8-byte (`\Uxxxxxxxx`). Also, if you use python2 syntax you shouldn't link to the documentation for python3 as that can be confusing for the readers. – mata Aug 12 '15 at 14:53
don't print text as bytes. Which encoding is used to decode bytes depends on context. – jfs Aug 13 '15 at 18:56

Unicode Characters in Twitter (Python)

2 Answers2