Working with UTF-8 encoding in Python source

Question

Consider:

$ cat bla.py 
u = unicode('d…')
s = u.encode('utf-8')
print s
$ python bla.py 
  File "bla.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file bla.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

How can I declare UTF-8 strings in source code?

"See http://www.python.org/peps/pep-0263.html for details" seems clear to me. — Lennart Regebro, May 04 '13 at 16:27

Michał Niklas · Answer 1 · 2023-01-03T07:54:37.737

871

In Python 3, UTF-8 is the default source encoding (see PEP 3120), so Unicode characters can be used anywhere.

In Python 2, you can declare in the source code header:

# -*- coding: utf-8 -*-
....

This is described in PEP 0263.

Then you can use UTF-8 in strings:

# -*- coding: utf-8 -*-

u = 'idzie wąż wąską dróżką'
uu = u.decode('utf8')
s = uu.encode('cp1250')
print(s)

edited Jan 03 '23 at 07:54

answered Jun 09 '11 at 07:31

Michał Niklas

53,067
18
70
114

8

now it gives """UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)""" – Nullpoet Jun 09 '11 at 07:36
1

You need not use `unicode()`, simply write string in UTF-8 encoding. – Michał Niklas Jun 09 '11 at 08:03
32

In Python versions older than 3, you also need to prefix unicode string literals with "u": `some_string = u'idzie wąż wąską dróżką'`. – Anton Strogonoff Jun 09 '11 at 08:06
on a diffrent string I am getting """UnicodeEncodeError: 'charmap' codec can't encode characters in position 1845-1846: character maps to """... does that mean a different encoding is required? – Nullpoet Jun 09 '11 at 08:20
3

or #!/usr/bin/env python # coding: utf-8 – warvariuc Jun 09 '11 at 08:47
Whole source code must be saved in UTF-8 encoding. Some text editors have 'Save As...' where you can also set encoding. – Michał Niklas Jun 09 '11 at 08:48
Can you please be more specific about coding: utf8 header? I can't see any explanation in your provided PEP page. – Karolis Feb 06 '13 at 17:20
@Karolis, see the [Unicode Literals in Python Source Code](http://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code) section of the Python docs Unicode HOWTO – alldayremix May 11 '13 at 06:19
If you are reading from FILE you should consider using of codecs as shown here: http://stackoverflow.com/questions/10376923/reading-non-ascii-characters-from-a-text-file – andilabs Nov 04 '13 at 12:57
Is it possible to have that as default system-wide? (when editing temporary short scripts for instance) – lajarre Mar 18 '14 at 09:58
No, declaring is the only way I know. Many editors can use code templates so if you open new Python it is opened with code you like. – Michał Niklas Mar 18 '14 at 10:17
use `# -*- coding: utf-8 -*-` or `# encoding: utf-8` ? oh, I think I found it: https://www.python.org/dev/peps/pep-0263/ – zx1986 Jul 21 '15 at 07:30
where is this source header file located? – Aslam Khan May 21 '16 at 08:22
I put it on the 2nd line of code. In 1st there is shebang. – Michał Niklas May 23 '16 at 07:04
@MichałNiklas I removed the `()` around `print` because it's a Python 2 code snippet and in Python 2 you don't use brackets for `print`. If you try to run that code on Python 3 you'll get this error: `AttributeError: 'str' object has no attribute 'decode'.`. – Boris Verkhovskiy Jan 02 '23 at 16:52
@BorisVerkhovskiy but Python 2 works very well with brackets in `print('wąż')`. You can see discusion on it: https://stackoverflow.com/questions/13415181/brackets-around-print-in-python – Michał Niklas Jan 03 '23 at 08:30
that's pretty hacky – Boris Verkhovskiy Jan 03 '23 at 08:36

score 92 · Answer 2 · edited Jul 08 '19 at 11:56

92

Do not forget to verify if your text editor encodes properly your code in UTF-8.

Otherwise, you may have invisible characters that are not interpreted as UTF-8.

edited Jul 08 '19 at 11:56

Peter Mortensen

30,738
21
105
131

answered Feb 18 '14 at 10:41

Ranaivo

1,578
12
6

2

Is this needed for python3? I know python3 assumes all literals within the code are unicode. But does it assume the source files are also written in utf8? – Ricardo Magalhães Cruz Jun 28 '16 at 23:58
1

@RicardoCruz Yes I believe utf-8 is the default for Python 3. See https://www.python.org/dev/peps/pep-3120/ – Jonathan Hartley Aug 10 '16 at 21:05
@ricardo-cruz *With Python 3, all strings will be Unicode strings, so the original encoding of the source will have no impact at run-time.* 1. [PEP 3120 -- Using UTF-8 as the default source encoding](https://www.python.org/dev/peps/pep-3120/) 2. [PEP 263 -- Defining Python Source Code Encodings](https://www.python.org/dev/peps/pep-0263/) – noobninja Jan 29 '17 at 01:45
@noobninja thanks for the links: PEP 3120 confirms that the source code itself is now assumed to be UTF-8, not just strings. – Ricardo Magalhães Cruz Jan 29 '17 at 10:38
25

Use `# coding: utf8` instead of `# -*- coding: utf-8 -*-`which is far easier to remember. – show0k Apr 10 '17 at 13:35
@noobninja "the original encoding of the source will have no impact at runtime" is incorrect. Unicode string *constants* require Python to know the encoding of the source file to properly generate the Unicode strings. Python 3 assumes UTF-8 so `#coding` is required if the source is in another encoding and there are non-ASCII characters present. – Mark Tolonen Aug 16 '17 at 00:48

Working with UTF-8 encoding in Python source

2 Answers2

Linked