Urldecoding string with Danish characters on Linux

Question

I have a small Python web.py application running on a SUSE Enterprise server. The purpose of the application is to receive an urlencoded string via HTTP POST, validate the input, form some XML and then send that XML to another HTTP POST service.

Everything works great, except when the urlencoded input contains any Danish characters and probably other special characters as well.

I'm trying to urldecode the string "æøåÆØÅ". Urlencoded the string looks like this: "%C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85"

For analyzing the problem I've created a small sample app which illustrates the problem.

I've used the trick from the answer here: python url unquote unicode

import urllib2

s1 = "%C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85"
print "s1", s1

s2 = urllib2.unquote(s1.encode('ascii'))
print "s2", repr(s2), s2

s3 = s2.decode('utf-8')
print "s3", repr(s3), s3

The problem is that the code works as expected in Windows 7, but on Linux (SUSE) where the application is hosted the output is garbage.

Output when run in Windows 7:

s1 %C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85
s2 '\xc3\xa6\xc3\xb8\xc3\xa5\xc3\x86\xc3\x98\xc3\x85' ├ª├©├Ñ├å├ÿ├à
s3 u'\xe6\xf8\xe5\xc6\xd8\xc5' æøåÆØÅ

Output when run in SUSE:

s1 %C3%A6%C3%B8%C3%A5%C3%86%C3%98%C3%85
s2 '\xc3\xa6\xc3\xb8\xc3\xa5\xc3\x86\xc3\x98\xc3\x85' Ã¦Ã¸Ã¥ÃÃÃ

s3 u'\xe6\xf8\xe5\xc6\xd8\xc5' Ã¦Ã¸Ã¥ÃÃÃ

Apparently the \xc5 becomes a line break. There is also a line break in s3 but I wasn't able to have an empty line in the code tag.

Furthermore when running this code on SUSE

for c in s3:
        print repr(c), unicodedata.name(c)

I get the following:

u'\xe6' LATIN SMALL LETTER AE 
u'\xf8' LATIN SMALL LETTER O WITH STROKE 
u'\xe5' LATIN SMALL LETTER A WITH RING ABOVE 
u'\xc6' LATIN CAPITAL LETTER AE 
u'\xd8' LATIN CAPITAL LETTER O WITH STROKE 
u'\xc5' LATIN CAPITAL LETTER A WITH RING ABOVE

So it seems like Python interprets the string correctly, but can't display it properly when printing the string to either console or file or XML string.

I'm guessing the problem is the encoding on the Linux server, but I have run out of ideas. Does anyone have any suggestions?

Aarg, the paste went haywire. This is a dupe of [Url decode UTF-8 in Python](http://stackoverflow.com/q/16566069) — Martijn Pieters, Jan 28 '14 at 12:40
You have UTF-8 data encoded as a URL; `urllib.unquote(s1).decode('utf8')` produces the right output, but your terminal appears to be misconfigured and is receiving UTF-8 data from the print but is *interpreting* the data as Latin 1. Python is working as intended. — Martijn Pieters, Jan 28 '14 at 12:43
Thanks for the replies. It doesn't solve my problem but confirms my suspicion in that problem lies more with the Linux charset/encoding than with Python — Anders Thrane Michelsen, Jan 28 '14 at 13:09

Urldecoding string with Danish characters on Linux

0 Answers0