1

Using Python3 to minimize the pain when dealing with Unicode, I can print a UTF-8 character as such:

>>> print (u'\u1010')
တ

But when trying to do the same with UTF-16, let's say U+20000, u'\u20000' is the wrong way to initialize the character:

>>> print (u'\u20000')
    0
>>> print (list(u'\u20000'))
['\u2000', '0']

It reads a 2 UTF-8 characters instead.

I've also tried the big U, i.e. u'\U20000', but it throws some escape error:

>>> print (u'\U20000')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape

Big U outside the string didn't work too:

>>> print (U'\u20000')
 0
>>> print (U'\U20000')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
alvas
  • 115,346
  • 109
  • 446
  • 738

2 Answers2

3

These are not UTF-8 and UTF-16 literals, but just unicode literals, and they mean the same:

>>> print(u'\u1010')
တ
>>> print(u'\U00001010')
တ
>>> print(u'\u1010' == u'\U00001010')
True

The second form just allows you to specify a code point above U+FFFF.

How to do this the easiest way: encode your source file as UTF-8 (or UTF-16), and then you can just write u"တ" and u"".

UTF-8 and UTF-16 are ways to encode those to bytes. To be technical, in UTF-8 that would be "\xf0\xa0\x80\x80" (which I would probably write as u"".encode("utf-8")).

roeland
  • 5,349
  • 2
  • 14
  • 28
  • My OP didn't have the word literal... Someone edited it... =) – alvas Jan 11 '17 at 05:26
  • 2
    @alvas you still have some confusion about the difference between the terms UTF-8, UTF-16, and Unicode. Until you understand those differences you will continue to have trouble. `'\U00020000'` isn't UTF-8 *or* UTF-16, it's a single Unicode character. – Mark Ransom Jan 11 '17 at 17:15
  • I think I understand them, just not the syntax to initialize them in Python =) http://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16, right? – alvas Jan 12 '17 at 00:35
  • @alvas if you've read and understand the question and all the answers, you're in better shape than most people. Now the problem is to be more precise in your usage of the terms. – Mark Ransom Jan 12 '17 at 18:00
2

As @Mark Ransom commented, Python's UTF16 \U notation requires eight characters to work.

Therefore, the Python code to use is:

u"\U00020000"

as listed on this page:

Python source code u"\U00020000"

Community
  • 1
  • 1
Right leg
  • 16,080
  • 7
  • 48
  • 81