34

Some source files, from downloaded code, have the following header

# -*- coding: utf-8 -*-

I have an idea what utf-8 encoding is but why would it be needed as a header in a python source file?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
  • 1
    Usually you use it when Python complains about you having unicode characters in your source. – Blender Dec 10 '12 at 19:28
  • 1
    My comments contain unicode characters (portuguese), so I include this (different) header: `# coding: utf-8` – heltonbiker Dec 10 '12 at 19:30
  • if your strings look like `u"\u00b0C"` they do not need the header .. however strings like `"ØÆÅ "` would require the header ... – Joran Beasley Dec 10 '12 at 19:37
  • 1
    @heltonbiker: The reason to use the OP's form is that it informs Python and various text editors at the same time. Even if your editor doesn't understand this coding declaration, and doesn't need it because you've told it to default to utf-8 for Python code, someone else may read your code in a different editor… Unfortunately, there's no way to set something that triggers both vim-style and emacs-style editors at once, but since vim and emacs themselves can be configured to read each others' style, you can usually get away with just the emacs one. – abarnert Dec 10 '12 at 19:38
  • 1
    If you read http://www.python.org/dev/peps/pep-0263/ it explains the rationale in depth. – abarnert Dec 10 '12 at 19:39
  • @abarnert Thanks a lot, I didn't know about it, gonna take a read! (coming back already): I found this, which makes both forms equally valid, I think: "More precisely, the first or second line must match the regular expression `"coding[:=]\s*([-\w.]+)"` " – heltonbiker Dec 10 '12 at 19:40
  • @heltonbiker: The last comment was actually for the OP, not you, but I'm glad it helped someone. :) Anyway, the PEP doesn't really explain emacs-style coding declarations, it just refers to it as "formats recognized by popular editors", and later uses the term "Emacs style" way down in the examples. But hopefully you get the idea. – abarnert Dec 10 '12 at 19:46
  • @mgilson: You usually need it for literals, not variable names—especially since in 2.x, variable names can only use `[A-Za-z0-9_]` no matter what the coding declaration is (otherwise, there'd be no way for one module to refer to symbols from a module with a different charset). But yeah, 3.x's support for Unicode variable names is pretty cool for non-English-natives. – abarnert Dec 10 '12 at 19:54
  • @abarnert -- Thanks for the clarification. I've never really needed to use any non ASCII characters in my source (literals or otherwise) which explains my ignorance :). – mgilson Dec 10 '12 at 19:56

5 Answers5

16

wherever you need to use in your code chars that aren't from ascii, like:

ă 

interpreter will complain that he doesn't understand that char.

Usually this happens when you define constants.

Example: Add into x.py

print 'ă'

then start a python console

import x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "x.py", line 1
 SyntaxError: Non-ASCII character '\xc4' in file x.py on line 1, 
   but no encoding declared;
   see http://www.python.org/peps/pep-0263.html for details
mihaicc
  • 3,034
  • 1
  • 24
  • 20
11

A more direct answer:

In Python 3+: you don't need to declare any encoding.

UTF-8 is the default. Make sure the file is encoded in UTF-8 without BOM. Some Windows editors don't have it by default. It won't hurt to declare it, and some editors may use it.

In Python 2: always.

The default is OS dependent. As first or second line line of your files put this comment:

# -*- coding: utf-8 -*-

And remember: this is just about your source code files. Now in the 3rd millennium the string type does not exist anymore. You must take care of the type text, that is a sequence of bytes and an encoding. You'll still have to define your encoding in all input and output operation. These operations will still be dependent on your environment, so it's still better to follow the rule: Explicit is better than implicit.

neves
  • 33,186
  • 27
  • 159
  • 192
8

Always use UTF-8 and make sure your editor also uses UTF-8. Start your Python script like this if you use Python 27:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

This is a good blog post from Nick Johnson about Python and UTF-8:

http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python By the way, this post was written before he could use:

from __future__ import unicode_literals
voscausa
  • 11,253
  • 2
  • 39
  • 67
  • Downvote for "Python 3 = UTF-8". This is at least confusing, if not plain wrong. – Dr. Jan-Philip Gehrcke Nov 27 '15 at 10:16
  • OK. Let makes things difficult: Python 3 says: everything is Unicode (by default, except in certain situations, and except if we send you crazy reencoded data, and even then it's sometimes still unicode, albeit wrong unicode). More here: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ – voscausa Nov 29 '15 at 21:49
  • Yes, and unicode != UTF-8. Really, really. :-). But even "Python = unicode" would be plain confusing. What would that even mean? Let's build proper sentences to say what we actually want to express. And by the way: what do you want to say with "And never use the str() function. Only use str() if you really have to!!!"? Can you explain that? – Dr. Jan-Philip Gehrcke Nov 30 '15 at 14:36
  • Always use an encoding, like b'image/jpeg' or unicode_var.encode('utf-8'). I have updated my answer. – voscausa Dec 01 '15 at 01:03
  • 1
    Use `#!/usr/bin/env python`, not `#!/usr/bin/python`. – mattmc3 Oct 21 '16 at 19:05
  • OK, more here: http://stackoverflow.com/questions/2429511/why-do-people-write-usr-bin-env-python-on-the-first-line-of-a-python-script – voscausa Oct 22 '16 at 01:17
3

When you use non-ascii characters. For instance when I comment my source in norwegian if charachters ØÆÅ occur in the .py it will complain and not "compile".

arynaq
  • 6,710
  • 9
  • 44
  • 74
2

Whenever text is read or written, encodings come in play. Always. A python interpreter has to read your file as text, to understand it. The only situation where you could get away without having to deal with encodings is when you only use characters in the ASCII range. The interpreter can in this case use virtually any encoding in the world, and get it right because almost all encodings encode these characters to same bytes.

You should not use coding: utf-8 just because you have characters beyond ascii in your file, it can even be harmful. It is a hint for the python interpreter, to tell it what encoding your file is in. Unless you have configured your text editor, the text editor will most likely not save your files in utf-8. So now the hint you gave to the python interpreter, is wrong.

So you should use it when your file is encoded in utf-8. If it's encoded in windows-1252, you should use coding: windows-1252 and so on.

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • 3
    Exactly which text editor will most likely not save your files in utf-8? I can't find a single editor on my Mac or linux boxes that doesn't default to either UTF-8, or whatever the default locale charset is (which is also UTF-8 on any Mac and most linux installations). Are there still popular Windows editors that default to Latin-1 or windows-1252 or something? – abarnert Dec 10 '12 at 19:44
  • @abarnert I have tried dozens of editors in windows and they never defaulted to utf-8. But yeah, I was implicitly referring editors in windows. Anyway it doesn't do harm if you always assume that an editor will not default to utf-8 – Esailija Dec 10 '12 at 19:46
  • IIRC (and I may not), Visual Studio is configurable per language, but the default is strict ASCII for C/C++/asm and UTF-8 for almost everything else. Beyond VS, I mostly used NTemacs, which obviously isn't a typical Windows editor. So, if there are lots of Windows programmers' editors that default to something silly, this is important advice. – abarnert Dec 10 '12 at 19:49
  • @abarnert Yeah, see this for example http://stackoverflow.com/q/13792061/995876. Anyway, many programs default to platform encoding, which is never utf-8 in windows and is (always?) utf-8 in linux. – Esailija Dec 10 '12 at 19:51
  • 1
    Well, it's not quite "always utf-8 in linux", so much as "almost always utf-8 in most recent linux distros", which is a fancy way of saying "so close to always that you won't discover your bugs until six months after you've forgotten the code and someone tries to run it on a weird system". OS X makes things a lot easier by absolutely requiring OS X everywhere, no matter what (except within "legacy" classic-Mac or NeXT technologies). – abarnert Dec 10 '12 at 20:03