460

Consider:

$ cat bla.py 
u = unicode('d…')
s = u.encode('utf-8')
print s
$ python bla.py 
  File "bla.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file bla.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

How can I declare UTF-8 strings in source code?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nullpoet
  • 10,949
  • 20
  • 48
  • 65

2 Answers2

871

In Python 3, UTF-8 is the default source encoding (see PEP 3120), so Unicode characters can be used anywhere.

In Python 2, you can declare in the source code header:

# -*- coding: utf-8 -*-
....

This is described in PEP 0263.

Then you can use UTF-8 in strings:

# -*- coding: utf-8 -*-

u = 'idzie wąż wąską dróżką'
uu = u.decode('utf8')
s = uu.encode('cp1250')
print(s)
Michał Niklas
  • 53,067
  • 18
  • 70
  • 114
  • 8
    now it gives """UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)""" – Nullpoet Jun 09 '11 at 07:36
  • 1
    You need not use `unicode()`, simply write string in UTF-8 encoding. – Michał Niklas Jun 09 '11 at 08:03
  • 32
    In Python versions older than 3, you also need to prefix unicode string literals with "u": `some_string = u'idzie wąż wąską dróżką'`. – Anton Strogonoff Jun 09 '11 at 08:06
  • on a diffrent string I am getting """UnicodeEncodeError: 'charmap' codec can't encode characters in position 1845-1846: character maps to """... does that mean a different encoding is required? – Nullpoet Jun 09 '11 at 08:20
  • 3
    or #!/usr/bin/env python # coding: utf-8 – warvariuc Jun 09 '11 at 08:47
  • Whole source code must be saved in UTF-8 encoding. Some text editors have 'Save As...' where you can also set encoding. – Michał Niklas Jun 09 '11 at 08:48
  • Can you please be more specific about coding: utf8 header? I can't see any explanation in your provided PEP page. – Karolis Feb 06 '13 at 17:20
  • @Karolis, see the [Unicode Literals in Python Source Code](http://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code) section of the Python docs Unicode HOWTO – alldayremix May 11 '13 at 06:19
  • If you are reading from FILE you should consider using of codecs as shown here: http://stackoverflow.com/questions/10376923/reading-non-ascii-characters-from-a-text-file – andilabs Nov 04 '13 at 12:57
  • Is it possible to have that as default system-wide? (when editing temporary short scripts for instance) – lajarre Mar 18 '14 at 09:58
  • No, declaring is the only way I know. Many editors can use code templates so if you open new Python it is opened with code you like. – Michał Niklas Mar 18 '14 at 10:17
  • use `# -*- coding: utf-8 -*-` or `# encoding: utf-8` ? oh, I think I found it: https://www.python.org/dev/peps/pep-0263/ – zx1986 Jul 21 '15 at 07:30
  • where is this source header file located? – Aslam Khan May 21 '16 at 08:22
  • I put it on the 2nd line of code. In 1st there is shebang. – Michał Niklas May 23 '16 at 07:04
  • @MichałNiklas I removed the `()` around `print` because it's a Python 2 code snippet and in Python 2 you don't use brackets for `print`. If you try to run that code on Python 3 you'll get this error: `AttributeError: 'str' object has no attribute 'decode'.`. – Boris Verkhovskiy Jan 02 '23 at 16:52
  • @BorisVerkhovskiy but Python 2 works very well with brackets in `print('wąż')`. You can see discusion on it: https://stackoverflow.com/questions/13415181/brackets-around-print-in-python – Michał Niklas Jan 03 '23 at 08:30
  • that's pretty hacky – Boris Verkhovskiy Jan 03 '23 at 08:36
92

Do not forget to verify if your text editor encodes properly your code in UTF-8.

Otherwise, you may have invisible characters that are not interpreted as UTF-8.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ranaivo
  • 1,578
  • 12
  • 6
  • 2
    Is this needed for python3? I know python3 assumes all literals within the code are unicode. But does it assume the source files are also written in utf8? – Ricardo Magalhães Cruz Jun 28 '16 at 23:58
  • 1
    @RicardoCruz Yes I believe utf-8 is the default for Python 3. See https://www.python.org/dev/peps/pep-3120/ – Jonathan Hartley Aug 10 '16 at 21:05
  • @ricardo-cruz *With Python 3, all strings will be Unicode strings, so the original encoding of the source will have no impact at run-time.* 1. [PEP 3120 -- Using UTF-8 as the default source encoding](https://www.python.org/dev/peps/pep-3120/) 2. [PEP 263 -- Defining Python Source Code Encodings](https://www.python.org/dev/peps/pep-0263/) – noobninja Jan 29 '17 at 01:45
  • @noobninja thanks for the links: PEP 3120 confirms that the source code itself is now assumed to be UTF-8, not just strings. – Ricardo Magalhães Cruz Jan 29 '17 at 10:38
  • 25
    Use `# coding: utf8` instead of  `# -*- coding: utf-8 -*-`which is far easier to remember. – show0k Apr 10 '17 at 13:35
  • @noobninja "the original encoding of the source will have no impact at runtime" is incorrect. Unicode string *constants* require Python to know the encoding of the source file to properly generate the Unicode strings. Python 3 assumes UTF-8 so `#coding` is required if the source is in another encoding and there are non-ASCII characters present. – Mark Tolonen Aug 16 '17 at 00:48