Set encoding in Python 3 CGI scripts

Question

When writing a Python 3.1 CGI script, I run into horrible UnicodeDecodeErrors. However, when running the script on the command line, everything works.

It seems that open() and print() use the return value of locale.getpreferredencoding() to know what encoding to use by default. When running on the command line, that value is 'UTF-8', as it should be. But when running the script through a browser, the encoding mysteriously gets redefined to 'ANSI_X3.4-1968', which appears to be a just a fancy name for plain ASCII.

I now need to know how to make the cgi script run with 'utf-8' as the default encoding in all cases. My setup is Python 3.1.3 and Apache2 on Debian Linux. The system-wide locale is en_GB.utf-8.

score 17 · Accepted Answer · edited Dec 03 '16 at 17:35

Answering this for late-comers because I don't think that the posted answers get to the root of the problem, which is the lack of locale environment variables in a CGI context. I'm using Python 3.2.

open() opens file objects in text (string) or binary (bytes) mode for reading and/or writing; in text mode the encoding used to encode strings written to the file, and decode bytes read from the file, may be specified in the call; if it isn't then it is determined by locale.getpreferredencoding(), which on linux uses the encoding from your locale environment settings, which is normally utf-8 (from e.g. LANG=en_US.UTF-8)
```
>>> f = open('foo', 'w')         # open file for writing in text mode
>>> f.encoding
'UTF-8'                          # encoding is from the environment
>>> f.write('€')                 # write a Unicode string
1
>>> f.close()
>>> exit()
user@host:~$ hd foo
00000000  e2 82 ac      |...|    # data is UTF-8 encoded
```
sys.stdout is in fact a file opened for writing in text mode with an encoding based on locale.getpreferredencoding(); you can write strings to it just fine and they'll be encoded to bytes based on sys.stdout's encoding; print() by default writes to sys.stdout - print() itself has no encoding, rather it's the file it writes to that has an encoding;
```
>>> sys.stdout.encoding
'UTF-8'                          # encoding is from the environment
>>> exit()
user@host:~$ python3 -c 'print("€")' > foo
user@host:~$ hd foo
00000000  e2 82 ac 0a   |....|   # data is UTF-8 encoded; \n is from print()
```
; you cannot write bytes to sys.stdout - use sys.stdout.buffer.write() for that; if you try to write bytes to sys.stdout using sys.stdout.write() then it will return an error, and if you try using print() then print() will simply turn the bytes object into a string object and an escape sequence like \xff will be treated as the four characters \, x, f, f
```
user@host:~$ python3 -c 'print(b"\xe2\xf82\xac")' > foo
user@host:~$ hd foo
00000000  62 27 5c 78 65 32 5c 78  66 38 32 5c 78 61 63 27  |b'\xe2\xf82\xac'|
00000010  0a                                                |.|
```
in a CGI script you need to write to sys.stdout and you can use print() to do it; but a CGI script process in Apache has no locale environment settings - they are not part of the CGI specification; therefore the sys.stdout encoding defaults to ANSI_X3.4-1968 - in other words, ASCII; if you try to print() a string that contain non-ASCII characters to sys.stdout you'll get "UnicodeEncodeError: 'ascii' codec can't encode character...: ordinal not in range(128)"
a simple solution is to pass the Apache process's LANG environment variable through to the CGI script using Apache's mod_env PassEnv command in the server or virtual host configuration: PassEnv LANG; on Debian/Ubuntu make sure that in /etc/apache2/envvars you have uncommented the line ". /etc/default/locale" so that Apache runs with the system default locale and not the C (Posix) locale (which is also ASCII encoding); the following CGI script should run without errors in Python 3.2:
```
#!/usr/bin/env python3
import sys
print('Content-Type: text/html; charset=utf-8')
print()
print('<html><body><pre>' + sys.stdout.encoding + '</pre>h€lló wörld<body></html>')
```

Make sure line similar to `LANG="en_US.UTF-8"` is present in `/etc/default/locale` — Klesun, Dec 03 '16 at 14:56
I also encountered this question. My env is *Apache 2.4.6 / Centots 7.4 / Python 3.6*. The system env `LANG` variable is `en_US.UTF-8`. I set `PassEnv LANG` in httpd.conf, but it don't work. And I try `SetEnv LANG en_US.UTF-8`, it works. `locale.getpreferredencoding()` prints `utf-8`. I don't know why? — Jedore, Dec 07 '18 at 01:13
See my comment on bobince's answer to see why I downvoted this - so that future readers will see how to solve it. The information in this answer is good, but it addresses the question from the wrong set of assumptions. The host computer should not be providing encoding for the python environment, rather the python environment should not be assuming the encoding for you. If you are dealing with encoded data, use the stream buffer directly and bypass the default decode/encode. — Bretton Wade, Apr 20 '21 at 14:19

bobince · Answer 2 · 2012-02-18T00:19:49.823

You shouldn't read your IO streams as strings for CGI/WSGI; they aren't Unicode strings, they're explicitly byte sequences.

(Consider that Content-Length is measured in bytes and not characters; imagine trying to read a multipart/form-data binary file upload submission crunched into UTF-8-decoded strings, or return a binary file download...)

So instead use sys.stdin.buffer and sys.stdout.buffer to get the raw byte streams for stdio, and read/write binary with them. It is up to the form-reading layer to convert those bytes into Unicode string parameters where appropriate using whichever encoding your web page has.

Unfortunately the standard library CGI and WSGI interfaces don't get this right in Python 3.1: the relevant modules were crudely converted from the Python 2 originals using 2to3 and consequently there are a number of bugs that will end up in UnicodeError.

The first version of Python 3 that is usable for web applications is 3.2. Using 3.0/3.1 is pretty much a waste of time. It took a lamentably long time to get this sorted out and PEP3333 passed.

I agree. It seems like very bad behaviour for a package to force ASCII mode even though the default is now supposed to be Unicode for all text and files. Python 3.2 is not yet in Debian (stable) so I'm pretty much stuck with what 3.1 has to offer, for now. — jforberg, Feb 18 '12 at 11:52
Revisiting nearly 10 years later, this is the correct answer. The Apache HTTPD doesn't do any encoding/decoding, it's strictly the python layers that are doing this. The data in/out have nothing to do with the host machine. The source data came from the client, and the result will be sent back to the client. — Bretton Wade, Apr 20 '21 at 14:14

jforberg · Answer 3 · 2012-02-17T14:31:27.943

I solved my problem with the following code:

import locale                                  # Ensures that subsequent open()s 
locale.getpreferredencoding = lambda: 'UTF-8'  # are UTF-8 encoded.

import sys                                     
sys.stdin = open('/dev/stdin', 'r')       # Re-open standard files in UTF-8 
sys.stdout = open('/dev/stdout', 'w')     # mode.
sys.stderr = open('/dev/stderr', 'w')

This solution is not pretty, but it seems to work for the time being. I actually chose Python 3 over the more commonplace v. 2.6 as my development platform due to the advertised good Unicode-handling, but the cgi package seems to ruin some of that simpleness.

I'm led to believe that the /dev/std* files may not exist on older systems that do not have a procfs. They are supported on recent Linuxes, however.

I tried @cercatrova's answer above (edit `/etc/apache2/envvars`), but unfortunately that didn't work. @jforberg's solution worked, although I had to change `UTF-8` to `latin-1`. — ErikusMaximus, Apr 20 '18 at 15:33

Klesun · Answer 4 · 2018-12-31T01:38:13.833

3

Summarizing @cercatrova 's answer:

Add PassEnv LANG line to the end of your /etc/apache2/apache2.conf or .htaccess.
Uncomment . /etc/default/locale line in /etc/apache2/envvars.
Make sure line similar to LANG="en_US.UTF-8" is present in /etc/default/locale.
sudo service apache2 restart

edited Dec 31 '18 at 01:38

answered May 30 '17 at 20:49

Klesun

12,280
5
59
52

This solution has solved a multi day odyssey of wrangling with trying to execute cskit through Symfony process (or PHP exec) in both Apache2 and Nginx. Thanks you! – cherrysoft May 06 '19 at 15:09
Thanks working for me. I've been struggling for 2 days . – Lingjing France Oct 26 '19 at 19:44

score 2 · Answer 5 · answered May 19 '20 at 08:26

2

Short answer: as detailed in mod_cgi + utf8 + Python3 produces no output, just add this in .htaccess:

SetEnv PYTHONIOENCODING utf8

along with:

Options +ExecCGI
AddHandler cgi-script .py

answered May 19 '20 at 08:26

Basj

41,386
99
383
673

score 1 · Answer 6 · answered Feb 17 '12 at 03:34

1

Your best bet is to explicitly encode your Unicode strings into bytes using the encoding you want to use. Relying on the implicit conversion will lead to troubles like this.

BTW: If the error is really UnicodeDecodeError, then it isn't happening on output, it's trying to decode a byte stream into Unicode, which would happen somewhere else.

answered Feb 17 '12 at 03:34

Ned Batchelder

364,293
75
561
662

Well, the script does both file input and output, so I get both decode and encode errors. Since the cgi package forces ASCII mode, my unicode encoded files won't read properly. – jforberg Feb 18 '12 at 11:50

score 0 · Answer 7 · edited Feb 29 '20 at 15:15

0

I have encountered the same problem. My environment is Windows10 + Apache 2.4 + Python 3.8.
As I am developing an overlay for Google Earth Pro, which only accept CGI to get dynamic content.
In the best answer, here is the reason but the method does not work.
My solution is:

sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)

It works well.

edited Feb 29 '20 at 15:15

סטנלי גרונן

2,917
23
46
68

answered Feb 29 '20 at 12:02

Ryan Tu

31
3

Set encoding in Python 3 CGI scripts

7 Answers7

Linked

Related