python scrips reads different unicode strings when run from shell or via CGI

Question

On my Ubuntu server I have a directory that contains these two files:

testDir# ls -als
insgesamt 12
4 drwxr-xr-x 2 root root 4096 Mai 29 15:12 .
4 drwxr-xr-x 6 root root 4096 Mai 28 18:38 ..
0 -rw-r--r-- 1 root root    0 Mai 28 19:17 Ö.txt
4 -rw-r--r-- 1 root root    9 Mai 28 19:16 Ö.txt

The file names look the same, but they are not. The file with size 0 has 1 character before the dot (Unicode code point 214 = Ö), the other file (size = 9) has two characters (code point 79 = O followed by 776 = ¨ which is a combining character and modifies the character before it). To display the unicode code points, I wrote this little script:

#!/usr/bin/env python3

import os

def printFileList(fileList):
    for file in fileList:
        string = ""
        for char in file:
            string += str(ord(char)) + " "
        string += "<br>"
        print(string)

print("Content-Type: text/html\n")

printFileList(os.listdir("testDir"))

printFileList(["Ö.txt", "Ö.txt"])

As you can see, I first read the filenames form the operation system and display the code points of the characters of the file names. Then I do the same, but with strings that are written hard coded in program code.

When I run this program from the shell, I get this result:

testDir# ./test.py
Content-Type: text/html

79 776 46 116 120 116 <br>
214 46 116 120 116 <br>
79 776 46 116 120 116 <br>
214 46 116 120 116 <br>

But this script (to be more precise: a more advanced version of this script) is meant to be run as a CGI script from a webserver. My webserver is Apache 2, and when I call this script from a browser, I get this result:

79 56524 56456 46 116 120 116 
56515 56470 46 116 120 116 
79 776 46 116 120 116 
214 46 116 120 116

The String Content-Type: text/html is part of the http protocol and will not be displayed, and <br> appears as line breaks, so these parts aren't visible in a browser for good reasons. But look at the numbers!

What should be 776 is 56524 56456 in the first line, and in the second line 214 became 56515 56470. But this happened only for the filenames read form the operating system. The hard coded strings are correct.

My questions:

1) What causes this strange behavior?
2) What has to be changed, so that the correct code points (776 and 214) are shown?

addendum

I added these lines to my program:

import sys

print(sys.getfilesystemencoding())

The output of this line is:

when run from the shell:
```
utf-8 
```
which is correct.
when run from apache as CGI-script:
```
ascii  
```
which is wrong.

So, my new question is:

How can I tell my script, that it always should use utf-8 as file system encoding?

Found something which might help you . https://stackoverflow.com/questions/9322410/set-encoding-in-python-3-cgi-scripts — apoorva kamath, May 29 '20 at 17:25
@apoorvakamath, no it looks like it's really the file-system encoding, not the IO encoding. — lenz, May 29 '20 at 18:14

Hubert Schölnast · Answer 1 · 2020-05-30T08:18:43.930

I am answering my own question.

I still don't have an answer to my first question ("What causes this strange behavior?"), so this still is open, and I am really curious about it.

But I have found a workaround to get correct results without really solving the original problem.

Here is a version of my test program that produced the same correct output when run from shell as well as from Apache as CGI script:

#!/usr/bin/env python3

import os

def printFileList(fileList):
    for file in fileList:
        file = file.decode("utf-8")
        string = ""
        for char in file:
            string += str(ord(char)) + " "
        string += "<br>"
        print(string)

print("Content-Type: text/html\n")

printFileList(os.listdir("testDir".encode("utf-8")))

printFileList(["Ö.txt".encode("utf-8"), "Ö.txt".encode("utf-8")])

And here is why it works:

os.listdir produces a list of unicode strings as output if it's input is a unicode string or a file descriptor. But if you feed in a sequence of bytes, the output will also be a list of byte sequences. This is well documented here: https://docs.python.org/3/library/os.html#os.listdir

But there is another difference between these two modes, that is not documented:

If the input is a sequence of bytes, python doesn't care about the encoding of the file system. It always reads the filenames as sequences of bytes and appends those sequences to the list that will be the output.
But if the input is something else (a unicode string or a file descriptor), then it also in a first step reads the bytes, but then uses the encoding that will be displayed when you call sys.getfilesystemencoding() to decode this sequence of bytes. If the byte sequence contains something, that is not conform to this encoding, this "garbage" will be replaced by surrogate characters.
This works well if sys.getfilesystemencoding() produces the correct output. (More precise: This works well if python did guess the file system encoding correctly. sys.getfilesystemencoding() doesn't make this guess, it only displays the result from this guess.) But for a reason I'm still curious about, this guess is wrong if the script is run by Apache as a CGI script. In the setting described here, the real file system encoding is utf-8 but python believes it was ascii if it was started from Apache. And so it produced an incorrect output.

The solution is to use os.listdir in that mode where it doesn't perform any encodings and conversions. And this means: bytes in, bytes out.

To do this, you have to replace

os.listdir("testDir")

by

os.listdir("testDir".encode("utf-8"))

Now os.listdir will work in byte mode, and its output will also be a list of byte sequences. To use them as unicode strings, you just need to decode the byte sequences with this line:

file = file.decode("utf-8")

^{(The encoding in the last line of my little program ("Ö.txt".encode("utf-8")) was only necessary because my function printFileList now no longer was able to process lists of unicode string, but only lists of byte sequences.)}

But be careful: This is not a solution of the problem. This is only a workaround. If you implement it as described here, it only will work if the actual file system encoding really is utf-8.

I think that the routine within python, that tries to guess the file system encoding, has a bug. It doesn't work properly and makes a wrong guess when python is started form Apache. A real solution would be to fix this bug.

Another possibility is, that there is some wrong setting of Apache 2 that makes Python believe to work on an ascii based file system. Maybe you just need to find this setting and correct it, but I have no idea if a) there is really such an Apache setting, and b) if so, which parameter needs to be set to which value.

python scrips reads different unicode strings when run from shell or via CGI

addendum

1 Answers1