Python and UTF-8: kind of confusing

Question

I am on google app engine with Python 2.5. My application have to deal with multilanguages so I have to deal with utf-8.

I have done lots of google but dont get what I want.

1.Whats the usage of # -*- coding: utf-8 -*- ?

2.What is the difference between

s=u'Witaj świecie'
s='Witaj świecie'

'Witaj świecie' is a utf-8 string.

3.When I save the .py file to 'utf-8', do I still need the u before every string?

1. duplicate of: http://stackoverflow.com/questions/4872007/where-does-this-come-from-coding-utf-8 2. duplicate of: http://stackoverflow.com/questions/4172652/python-what-does-u-represent 3. you will find a thorough answer in Python's excellent documentation: http://docs.python.org/howto/unicode.html#unicode-literals-in-python-source-code — mechanical_meat, May 26 '12 at 07:41
explanations how to deal with unicode strings in python code http://stackoverflow.com/a/10650469/624829 — Zeugma, May 26 '12 at 11:23

score 6 · Accepted Answer · answered May 26 '12 at 07:41

6

u'blah' turns it into a different kind of string (type unicode rather than type str) - it makes it a sequence of unicode codepoints. Without it, it is a sequence of bytes. Only bytes can be written to disk or to a network stream, but you generally want to work in Unicode (although Python, and some libraries, will do some of the conversion for you) - the encoding (utf-8) is the translation between these. So, yes, you should use the u in front of all your literals, it will make your life much easier. See Programatic Unicode for a better explanation.

The coding line tells Python what encoding your file is in, so that Python can understand it. Again, reading from disk gives bytes - but Python wants to see the characters. In Py2, the default encoding for code is ASCII, so the coding line lets you put things like ś directly in your .py file in the first place - other than that, it doesn't change how your code works.

answered May 26 '12 at 07:41

lvc

34,233
10
73
98

Hi thanks. What is the difference between u'Witaj świecie' and unicode(u'Witaj świecie', 'utf-8') and unicode('Witaj świecie', 'utf-8') ? – Susan Mayer May 26 '12 at 13:09
@SusanMayer The first and the last will end up with the same result (in a utf8 encoded .py), but go about it differently: the first gets you the unicode string right from the parser, the last builds a byte string and then asks Python to build a unicode string from it using utf-8. This is broadly similar to the difference between `[1, 2]` and `list((1, 2))`. The middle one is an error: the first argument is already a unicode string, and you're telling Python to treat it as a utf8 encoded byte string - which doesn't make sense. – lvc May 26 '12 at 13:50
Whats the difference between `unicode('Witaj świecie', 'utf-8')` and `unicode('Witaj świecie') `? If it is a `utf-8` .py file, is there no difference? If it is not a `utf-8` file, is there any difference? Thanks! – Susan Mayer May 26 '12 at 23:26
@SusanMayer the latter uses the default encoding, which isn't affected by the encoding of your source file. In Py2, it is usually ASCII, and so that breaks. The former explicitly mentions the encoding to try. The only time the encoding of the .py makes a difference to what encoding you want to tell `unicode()` (or, indeed, `.encode()` and `.decode()`, the preferred spellings these days) is in the case of these literals - and you're better off using the `u'...'` notation in these cases, so that your functionality doesn't depend on the source file encoding. – lvc May 27 '12 at 00:08

Python and UTF-8: kind of confusing

1 Answers1

Linked