I have a small Python web.py application running on a SUSE Enterprise server. The purpose of the application is to receive an urlencoded string via HTTP POST, validate the input, form some XML and then send that XML to another HTTP POST service.
Everything works great, except when the urlencoded input contains any Danish characters and probably other special characters as well.
I'm trying to urldecode the string "æøåÆØÅ". Urlencoded the string looks like this: "%C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85"
For analyzing the problem I've created a small sample app which illustrates the problem.
I've used the trick from the answer here: python url unquote unicode
import urllib2
s1 = "%C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85"
print "s1", s1
s2 = urllib2.unquote(s1.encode('ascii'))
print "s2", repr(s2), s2
s3 = s2.decode('utf-8')
print "s3", repr(s3), s3
The problem is that the code works as expected in Windows 7, but on Linux (SUSE) where the application is hosted the output is garbage.
Output when run in Windows 7:
s1 %C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85
s2 '\xc3\xa6\xc3\xb8\xc3\xa5\xc3\x86\xc3\x98\xc3\x85' ├ª├©├Ñ├å├ÿ├à
s3 u'\xe6\xf8\xe5\xc6\xd8\xc5' æøåÆØÅ
Output when run in SUSE:
s1 %C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85
s2 '\xc3\xa6\xc3\xb8\xc3\xa5\xc3\x86\xc3\x98\xc3\x85' æøåÃÃÃ
s3 u'\xe6\xf8\xe5\xc6\xd8\xc5' æøåÃÃÃ
Apparently the \xc5 becomes a line break. There is also a line break in s3 but I wasn't able to have an empty line in the code tag.
Furthermore when running this code on SUSE
for c in s3:
print repr(c), unicodedata.name(c)
I get the following:
u'\xe6' LATIN SMALL LETTER AE
u'\xf8' LATIN SMALL LETTER O WITH STROKE
u'\xe5' LATIN SMALL LETTER A WITH RING ABOVE
u'\xc6' LATIN CAPITAL LETTER AE
u'\xd8' LATIN CAPITAL LETTER O WITH STROKE
u'\xc5' LATIN CAPITAL LETTER A WITH RING ABOVE
So it seems like Python interprets the string correctly, but can't display it properly when printing the string to either console or file or XML string.
I'm guessing the problem is the encoding on the Linux server, but I have run out of ideas. Does anyone have any suggestions?