python url unquote followed by unicode decode

Question

I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string.
I used urllib.unquote_plus(str) but it works wrong.

expected : çöasd+fjkls%asd
result : Ã§Ã¶asd fjkls%asd

double coded utf-8 characters(%C3%A7 and %C3%B6) are decoded wrong.
My python version is 2.7 under a linux distro. What is the best way to get expected result?

Please do your attempted helpers a favour and publish the result of executing `import sys; print sys.stdout.encoding` — John Machin, Feb 28 '11 at 10:03
Indeed, the decoding itself is probably working OK, but the reencoding for console display may be having problems. — ncoghlan, Feb 28 '11 at 10:19

John Machin · Accepted Answer · 2013-02-03T20:32:28.487

You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar.

Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.

If as you say you start off with a unicode object:

>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'

this is an accidental nonsense. If you apply urllibX.unquote_YYYY() to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd') which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:

>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'

then you should unquote it:

>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'

Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:

>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'

and inspect it to see what we've actually got:

>>> import unicodedata
>>> for c in s3[:6]:
...     print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN

Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd

Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3 are representable in that encoding (and cp1252 and latin1).

I didn't have the same problem as OP but your clear walkthrough of encoding and decoding helped me immediately get working what I haven't been able to from reading quite a bit of documentation. Thank you. — KobeJohn, Jan 07 '13 at 14:57

Duncan · Answer 2 · 2011-02-28T09:06:57.897

12

Using either unquote or unquote_plus will give you a byte string. If you want a Unicode string then you have to decode the byte string to unicode:

>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd').decode('utf8'))
çöasd fjkls%asd
>>>

Compared with:

>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd'))
Ã§Ã¶asd fjkls%asd
>>>

Note that your input string must be a byte string: if you pass unicode to unquote/unquote_plus then you'll get a bit of a mess. If this is the case then encode it first:

>>> print(urllib.unquote_plus(u'%C3%A7%C3%B6asd+fjkls%25asd'.encode('ascii')).decode('utf8'))
çöasd fjkls%asd

edited Feb 28 '11 at 09:06

answered Feb 28 '11 at 09:00

Duncan

92,073
11
122
156

1

+1 for the `urllib.unquote_plus(u'äö'.encode('ascii')).decode('utf8')` I needed this in Django 1.7 to decode a [file upload name](https://docs.djangoproject.com/en/1.7/ref/files/uploads/). – Larpon Sep 10 '14 at 23:30

score 0 · Answer 3 · answered Feb 28 '11 at 07:33

0

Try urllib2 once more:

print urllib2.unquote('%C3%A7%C3%B6asd+fjkls%25asd')

answered Feb 28 '11 at 07:33

Blender

289,723
53
439
496

thanks for your quick reply i have already tried it and it gave me the same result. do you have any other suggestion ? – user637287 Feb 28 '11 at 07:46

score 0 · Answer 4 · answered Feb 28 '11 at 07:43

0

'%C3%A7%C3%B6asd+fjkls%25asd' - this is not a unicode string.

This is a url-encoded string. Use urllib2.unquote() instead.

answered Feb 28 '11 at 07:43

this is the results : `>>> import urllib2 >>> print urllib2.unquote('%C3%A7%C3%B6asd+fjkls%25asd') Ã§Ã¶asd+fjkls%asd` my python version is 2.7 can be the problem because of version differences ? – user637287 Feb 28 '11 at 08:05

score 0 · Answer 5 · answered Feb 28 '11 at 08:14

0

You have a double problem: your string is unicode encoded and contains caracter urlencoded. Some match. You can normalize your string to ascci to be sure it won't be interpreted incorrectly:

>>> s = '%C3%A7%C3%B6asd+fjkls%25asd' # ascii string
>>> print urllib2.unquote(s) # works as expected
çöasd+fjkls%asd
>>> s = u'%C3%A7%C3%B6asd+fjkls%25asd' # unicode string
>>> print urllib2.unquote(s) # decode stuff that it shouldn't
Ã§Ã¶asd+fjkls%asd
>>> print urllib2.unquote(s.encode('ascii')) # encode the unicode string to ascii: works!
çöasd+fjkls%asd

answered Feb 28 '11 at 08:14

Bite code

578,959
113
301
329

1

i really think something is wrong with my python version because i copied your code but result was 'Ã§Ã¶asd+fjkls%asd' again. even though i have already investigated alternatives, do you know any other module i can use instead of urllib – user637287 Feb 28 '11 at 08:36
The problem is unlikely to be Python. But to be honest, I'm running out or rational explanations :-) Did you tried voodoo ? Have you tried it directly in the Python shell ? If no, you may want to define the encoding of your file on the top of it. What is your OS? I'm guessing windows since has a lot of encoding issues. – Bite code Feb 28 '11 at 09:58
voodoo? a little outdated; try a tambourine (http://www.elcomsoft.com/tambourine.html?r1=pr&r2=april1) or (much better) the `repr()` built-in function. – John Machin Feb 28 '11 at 12:03

score -1 · Answer 6 · answered Feb 28 '11 at 07:50

-1

You are using unquote_plus method which is taking space into account and converting to +. Just use unquote method and you should be fine.

>>> import urllib
>>> print urllib.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd+fjkls%asd
>>> print urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd fjkls%asd

answered Feb 28 '11 at 07:50

Senthil Kumaran

54,681
14
94
131

actually, what I expected is the second output but i am doing exactly the same thing and here is my results; `>>> import urllib >>> print urllib.unquote('%C3%A7%C3%B6asd+fjkls%25asd') Ã§Ã¶asd+fjkls%asd >>> print urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd') Ã§Ã¶asd fjkls%asd` – user637287 Feb 28 '11 at 08:00
Encoding your string to ascii ( `s.encode('ascii')`) and then using quote_plus. It should do. – Senthil Kumaran Feb 28 '11 at 09:13

python url unquote followed by unicode decode

6 Answers6

Linked

Related