8

I am using Python 2.7.3. Can anybody explain the difference between the literals:

'\u0391'

and:

u'\u0391'

and the different way they are echoed in the REPL below (especially the extra slash added to a1):

>>> a1='\u0391'
>>> a1
'\\u0391'
>>> type(a1)
<type 'str'>
>>> 
>>> a2=u'\u0391'
>>> a2
u'\u0391'
>>> type(a2)
<type 'unicode'>
>>> 
hippietrail
  • 15,848
  • 18
  • 99
  • 158
Marcus Junius Brutus
  • 26,087
  • 41
  • 189
  • 331
  • 1
    It's worth noting that in Python 3, these are identical, and both of type `str`, because `str` is now Unicode (but `b'\u0391'` is still equivalent to your `a1`, except it's of type `bytes`). – abarnert Jan 28 '13 at 10:02

2 Answers2

9

You can only use unicode escapes (\uabcd) in a unicode string literal. They have no meaning in a byte string. A Python 2 Unicode literal (u'some text') is a different type of Python object from a python byte string ('some text').

It's like using \t versus \T; the former has meaning in python literals (it's interpreted as a tab character), the latter just means a backslash and a capital T (two characters).

To help understand the difference between Unicode and byte strings, please do read the Python Unicode HOWTO; I can also recommend the Joel Spolsky on Unicode article.

Note: in Python 3, the same differences apply, but 'some text' is a Unicode string literal, and b'some text' is the bytestring syntax.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • +1. If you want a `str` with `\u0391` in it, you need to pick an encoding, and write, e.g., `u'\u0391'.encode('utf-8')`, which will give you `'\xce\x91'`. – abarnert Jan 28 '13 at 10:00
  • If u'some text' is different than 'some text' how do you explain: u'a'=='a' which evaluates to True ? – Marcus Junius Brutus Jan 28 '13 at 10:04
  • @MarcusJuniusBrutus: Python auto-decodes to Unicode when comparing the two. You can compare floats to integers too, doesn't make them the same type though. Python decodes the byte string to attempt the test; try `u'\u0391' == u'\u0391'.encode('utf8')` and you'll get a warning (decoding is done from ASCII by default). – Martijn Pieters Jan 28 '13 at 10:13
  • @MarcusJuniusBrutus: The same way `1 == 1.0` is True. Equality doesn't necessarily mean identity. – abarnert Jan 28 '13 at 10:15
  • @MartijnPieters: Yeah, I deleted my comment after you edited yours, and before you responded. (But are you sure it's from ASCII rather than `sys.getdefaultencoding()`? Of course that's usually ASCII in 2.7 anyway…) – abarnert Jan 28 '13 at 10:20
  • @abarnert: exactly; I didn't want to complicate things more, let alone give people the idea that setting the default encoding is a good thing to do. ASCII is the default encoding on Python 2, and yes, `sys.getdefaultencoding()` is consulted when auto-converting unicode to bytes or vice versa. – Martijn Pieters Jan 28 '13 at 10:24
3

As opposed to C, in Python a string can be enclosed in simple quotes (') as well as double quotes (") -- leaving aside the triple-double quotes """.

Thus, '\u0391' is only a string containing the letters \, u, 0, 3, 9 and 1. When pretty printing this string, the \ is escaped via another \.

On the contrary, having a u in front makes the string to be considered Unicode and all escapes are evaluated. Thus, u'\u0391' is interpreted as "the Unicode string containing codepoint 0391" which is different from the above.

Mihai Maruseac
  • 20,967
  • 7
  • 57
  • 109