Url decode UTF-8 in Python

Question

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?

I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.

In the general case, the tail of a URL is just a cookie. You can't know which local character-set encoding the server uses or even whether the URL encodes a string or something completely different. (Granted, many URLs *do* encode a human-readable string; and often, you can guess the encoding very easily. But it's not possible in the generally case or completely automatically.) — tripleee, Jan 29 '18 at 12:45

score 625 · Accepted Answer · edited May 05 '19 at 16:40

625

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:

from urllib.parse import unquote

url = unquote(url)

Demo:

>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'

The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:

from urllib import unquote

url = unquote(url).decode('utf8')

edited May 05 '19 at 16:40

Keyur Potdar

7,158
6
25
40

answered May 15 '13 at 13:19

Martijn Pieters

1,048,767
296
4,058
3,343

3

So why is the + character left in the string? I thought that %2B was the + character and + literals were removed during decoding? – AlexLordThorsen Oct 02 '14 at 22:29
12

@Rawrgulmuffins `+` is a space in [`x-www-form-urlencoded` data](http://en.m.wikipedia.org/wiki/Application/x-www-form-urlencoded#The_application.2Fx-www-form-urlencoded_type); you'd use `urllib.parse.parse_qs()` to parse that, or use `urllib.parse.unquote_plus()`. But they should only appear in the query string, not the rest of the URL. – Martijn Pieters Oct 02 '14 at 23:14

score 165 · Answer 2 · edited Jan 10 '23 at 00:46

165

If you are using Python 3, you can use urllib.parse.unquote:

url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""

import urllib.parse
urllib.parse.unquote(url)

gives:

'example.com?title=правовая+защита'

edited Jan 10 '23 at 00:46

Karl Knechtel

62,466
11
102
153

answered Sep 08 '15 at 07:42

pavan

3,225
3
25
29

using this and getting a dict instead of query string on python3.8 – Clocker May 19 '20 at 16:41
@Clocker can't reproduce. Make sure to follow the example exactly. If you are having difficulty adapting it to your own needs, ask your own question, making sure to follow the advice in [ask] and [mre]. – Karl Knechtel Jan 10 '23 at 00:45

score 26 · Answer 3 · answered Aug 10 '20 at 23:19

26

You can achieve an expected result with requests library as well:

import requests

url = "http://www.mywebsite.org/Data%20Set.zip"

print(f"Before: {url}")
print(f"After:  {requests.utils.unquote(url)}")

Output:

$ python3 test_url_unquote.py

Before: http://www.mywebsite.org/Data%20Set.zip
After:  http://www.mywebsite.org/Data Set.zip

Might be handy if you are already using requests, without using another library for this job.

answered Aug 10 '20 at 23:19

ivanleoncz

9,070
7
57
49

1

Works with Python 2 too. – lfurini Feb 24 '21 at 11:06
This is just an [alias](https://github.com/psf/requests/blob/79f2ec3acc4e24fef6e6ce31ad7b1d4e2f77be31/requests/compat.py#L57) for `urllib.parse`. – bfontaine May 15 '22 at 16:59
This comment of yours added a lot. Thank you very much. – ivanleoncz May 17 '22 at 17:08

Roland Puntaier · Answer 4 · 2021-09-26T10:04:11.980

2

In HTML the URLs can contain html entities. This replaces them, too.

#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&amp;confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))

edited Sep 26 '21 at 10:04

answered Aug 11 '21 at 11:24

Roland Puntaier

3,250
30
35

`html.unescape` is unnecessary. – pylover Sep 23 '21 at 16:35
Without `unescape` on my computer `&` in the example is not converted to `&`. I just checked with Python 3.9.7. – Roland Puntaier Sep 26 '21 at 10:06
The question is about decoding URLs, not HTML. – bfontaine May 15 '22 at 16:57

score 2 · Answer 5 · answered Sep 20 '22 at 05:51

I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.

So I quickly wrote my own.

Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.

URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.

So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.

Here is the code:

def url_parse(url):
    l = len(url)
    data = bytearray()
    i = 0
    while i < l:
        if url[i] != '%':
            d = ord(url[i])
            i += 1
        
        else:
            d = int(url[i+1:i+3], 16)
            i += 3
        
        data.append(d)
    
    return data.decode('utf8')

I have tested it and it works perfectly.

Url decode UTF-8 in Python

5 Answers5

Linked

Related