On my Ubuntu server I have a directory that contains these two files:
testDir# ls -als
insgesamt 12
4 drwxr-xr-x 2 root root 4096 Mai 29 15:12 .
4 drwxr-xr-x 6 root root 4096 Mai 28 18:38 ..
0 -rw-r--r-- 1 root root 0 Mai 28 19:17 Ö.txt
4 -rw-r--r-- 1 root root 9 Mai 28 19:16 Ö.txt
The file names look the same, but they are not. The file with size 0 has 1 character before the dot (Unicode code point 214 = Ö), the other file (size = 9) has two characters (code point 79 = O followed by 776 = ¨ which is a combining character and modifies the character before it). To display the unicode code points, I wrote this little script:
#!/usr/bin/env python3
import os
def printFileList(fileList):
for file in fileList:
string = ""
for char in file:
string += str(ord(char)) + " "
string += "<br>"
print(string)
print("Content-Type: text/html\n")
printFileList(os.listdir("testDir"))
printFileList(["Ö.txt", "Ö.txt"])
As you can see, I first read the filenames form the operation system and display the code points of the characters of the file names. Then I do the same, but with strings that are written hard coded in program code.
When I run this program from the shell, I get this result:
testDir# ./test.py
Content-Type: text/html
79 776 46 116 120 116 <br>
214 46 116 120 116 <br>
79 776 46 116 120 116 <br>
214 46 116 120 116 <br>
But this script (to be more precise: a more advanced version of this script) is meant to be run as a CGI script from a webserver. My webserver is Apache 2, and when I call this script from a browser, I get this result:
79 56524 56456 46 116 120 116
56515 56470 46 116 120 116
79 776 46 116 120 116
214 46 116 120 116
The String Content-Type: text/html
is part of the http protocol and will not be displayed, and <br>
appears as line breaks, so these parts aren't visible in a browser for good reasons. But look at the numbers!
What should be 776
is 56524 56456
in the first line, and in the second line 214
became 56515 56470
. But this happened only for the filenames read form the operating system. The hard coded strings are correct.
My questions:
1) What causes this strange behavior?
2) What has to be changed, so that the correct code points (776
and 214
) are shown?
addendum
I added these lines to my program:
import sys
print(sys.getfilesystemencoding())
The output of this line is:
when run from the shell:
utf-8
which is correct.
when run from apache as CGI-script:
ascii
which is wrong.
So, my new question is:
How can I tell my script, that it always should use utf-8
as file system encoding?