23

When writing a Python 3.1 CGI script, I run into horrible UnicodeDecodeErrors. However, when running the script on the command line, everything works.

It seems that open() and print() use the return value of locale.getpreferredencoding() to know what encoding to use by default. When running on the command line, that value is 'UTF-8', as it should be. But when running the script through a browser, the encoding mysteriously gets redefined to 'ANSI_X3.4-1968', which appears to be a just a fancy name for plain ASCII.

I now need to know how to make the cgi script run with 'utf-8' as the default encoding in all cases. My setup is Python 3.1.3 and Apache2 on Debian Linux. The system-wide locale is en_GB.utf-8.

jforberg
  • 6,537
  • 3
  • 29
  • 47

7 Answers7

17

Answering this for late-comers because I don't think that the posted answers get to the root of the problem, which is the lack of locale environment variables in a CGI context. I'm using Python 3.2.

  1. open() opens file objects in text (string) or binary (bytes) mode for reading and/or writing; in text mode the encoding used to encode strings written to the file, and decode bytes read from the file, may be specified in the call; if it isn't then it is determined by locale.getpreferredencoding(), which on linux uses the encoding from your locale environment settings, which is normally utf-8 (from e.g. LANG=en_US.UTF-8)

    >>> f = open('foo', 'w')         # open file for writing in text mode
    >>> f.encoding
    'UTF-8'                          # encoding is from the environment
    >>> f.write('€')                 # write a Unicode string
    1
    >>> f.close()
    >>> exit()
    user@host:~$ hd foo
    00000000  e2 82 ac      |...|    # data is UTF-8 encoded
    
  2. sys.stdout is in fact a file opened for writing in text mode with an encoding based on locale.getpreferredencoding(); you can write strings to it just fine and they'll be encoded to bytes based on sys.stdout's encoding; print() by default writes to sys.stdout - print() itself has no encoding, rather it's the file it writes to that has an encoding;

    >>> sys.stdout.encoding
    'UTF-8'                          # encoding is from the environment
    >>> exit()
    user@host:~$ python3 -c 'print("€")' > foo
    user@host:~$ hd foo
    00000000  e2 82 ac 0a   |....|   # data is UTF-8 encoded; \n is from print()
    

    ; you cannot write bytes to sys.stdout - use sys.stdout.buffer.write() for that; if you try to write bytes to sys.stdout using sys.stdout.write() then it will return an error, and if you try using print() then print() will simply turn the bytes object into a string object and an escape sequence like \xff will be treated as the four characters \, x, f, f

    user@host:~$ python3 -c 'print(b"\xe2\xf82\xac")' > foo
    user@host:~$ hd foo
    00000000  62 27 5c 78 65 32 5c 78  66 38 32 5c 78 61 63 27  |b'\xe2\xf82\xac'|
    00000010  0a                                                |.|
    
  3. in a CGI script you need to write to sys.stdout and you can use print() to do it; but a CGI script process in Apache has no locale environment settings - they are not part of the CGI specification; therefore the sys.stdout encoding defaults to ANSI_X3.4-1968 - in other words, ASCII; if you try to print() a string that contain non-ASCII characters to sys.stdout you'll get "UnicodeEncodeError: 'ascii' codec can't encode character...: ordinal not in range(128)"

  4. a simple solution is to pass the Apache process's LANG environment variable through to the CGI script using Apache's mod_env PassEnv command in the server or virtual host configuration: PassEnv LANG; on Debian/Ubuntu make sure that in /etc/apache2/envvars you have uncommented the line ". /etc/default/locale" so that Apache runs with the system default locale and not the C (Posix) locale (which is also ASCII encoding); the following CGI script should run without errors in Python 3.2:

    #!/usr/bin/env python3
    import sys
    print('Content-Type: text/html; charset=utf-8')
    print()
    print('<html><body><pre>' + sys.stdout.encoding + '</pre>h€lló wörld<body></html>')
    

          

Klesun
  • 12,280
  • 5
  • 59
  • 52
cercatrova
  • 202
  • 2
  • 2
  • 3
    Make sure line similar to `LANG="en_US.UTF-8"` is present in `/etc/default/locale` – Klesun Dec 03 '16 at 14:56
  • I also encountered this question. My env is *Apache 2.4.6 / Centots 7.4 / Python 3.6*. The system env `LANG` variable is `en_US.UTF-8`. I set `PassEnv LANG` in httpd.conf, but it don't work. And I try `SetEnv LANG en_US.UTF-8`, it works. `locale.getpreferredencoding()` prints `utf-8`. I don't know why? – Jedore Dec 07 '18 at 01:13
  • See my comment on bobince's answer to see why I downvoted this - so that future readers will see how to solve it. The information in this answer is good, but it addresses the question from the wrong set of assumptions. The host computer should not be providing encoding for the python environment, rather the python environment should not be assuming the encoding for you. If you are dealing with encoded data, use the stream buffer directly and bypass the default decode/encode. – Bretton Wade Apr 20 '21 at 14:19
5

You shouldn't read your IO streams as strings for CGI/WSGI; they aren't Unicode strings, they're explicitly byte sequences.

(Consider that Content-Length is measured in bytes and not characters; imagine trying to read a multipart/form-data binary file upload submission crunched into UTF-8-decoded strings, or return a binary file download...)

So instead use sys.stdin.buffer and sys.stdout.buffer to get the raw byte streams for stdio, and read/write binary with them. It is up to the form-reading layer to convert those bytes into Unicode string parameters where appropriate using whichever encoding your web page has.

Unfortunately the standard library CGI and WSGI interfaces don't get this right in Python 3.1: the relevant modules were crudely converted from the Python 2 originals using 2to3 and consequently there are a number of bugs that will end up in UnicodeError.

The first version of Python 3 that is usable for web applications is 3.2. Using 3.0/3.1 is pretty much a waste of time. It took a lamentably long time to get this sorted out and PEP3333 passed.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • I agree. It seems like very bad behaviour for a package to force ASCII mode even though the default is now supposed to be Unicode for all text and files. Python 3.2 is not yet in Debian (stable) so I'm pretty much stuck with what 3.1 has to offer, for now. – jforberg Feb 18 '12 at 11:52
  • Revisiting nearly 10 years later, this is the correct answer. The Apache HTTPD doesn't do any encoding/decoding, it's strictly the python layers that are doing this. The data in/out have nothing to do with the host machine. The source data came from the client, and the result will be sent back to the client. – Bretton Wade Apr 20 '21 at 14:14
4

I solved my problem with the following code:

import locale                                  # Ensures that subsequent open()s 
locale.getpreferredencoding = lambda: 'UTF-8'  # are UTF-8 encoded.

import sys                                     
sys.stdin = open('/dev/stdin', 'r')       # Re-open standard files in UTF-8 
sys.stdout = open('/dev/stdout', 'w')     # mode.
sys.stderr = open('/dev/stderr', 'w') 

This solution is not pretty, but it seems to work for the time being. I actually chose Python 3 over the more commonplace v. 2.6 as my development platform due to the advertised good Unicode-handling, but the cgi package seems to ruin some of that simpleness.

I'm led to believe that the /dev/std* files may not exist on older systems that do not have a procfs. They are supported on recent Linuxes, however.

jforberg
  • 6,537
  • 3
  • 29
  • 47
  • I tried @cercatrova's answer above (edit `/etc/apache2/envvars`), but unfortunately that didn't work. @jforberg's solution worked, although I had to change `UTF-8` to `latin-1`. – ErikusMaximus Apr 20 '18 at 15:33
3

Summarizing @cercatrova 's answer:

  • Add PassEnv LANG line to the end of your /etc/apache2/apache2.conf or .htaccess.
  • Uncomment . /etc/default/locale line in /etc/apache2/envvars.
  • Make sure line similar to LANG="en_US.UTF-8" is present in /etc/default/locale.
  • sudo service apache2 restart
Klesun
  • 12,280
  • 5
  • 59
  • 52
2

Short answer: as detailed in mod_cgi + utf8 + Python3 produces no output, just add this in .htaccess:

SetEnv PYTHONIOENCODING utf8

along with:

Options +ExecCGI
AddHandler cgi-script .py
Basj
  • 41,386
  • 99
  • 383
  • 673
1

Your best bet is to explicitly encode your Unicode strings into bytes using the encoding you want to use. Relying on the implicit conversion will lead to troubles like this.

BTW: If the error is really UnicodeDecodeError, then it isn't happening on output, it's trying to decode a byte stream into Unicode, which would happen somewhere else.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • Well, the script does both file input and output, so I get both decode and encode errors. Since the cgi package forces ASCII mode, my unicode encoded files won't read properly. – jforberg Feb 18 '12 at 11:50
0

I have encountered the same problem. My environment is Windows10 + Apache 2.4 + Python 3.8.
As I am developing an overlay for Google Earth Pro, which only accept CGI to get dynamic content.
In the best answer, here is the reason but the method does not work.
My solution is:

sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)

It works well.

סטנלי גרונן
  • 2,917
  • 23
  • 46
  • 68
Ryan Tu
  • 31
  • 3