394

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?

I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
swordholder
  • 4,519
  • 3
  • 18
  • 14
  • 4
    In the general case, the tail of a URL is just a cookie. You can't know which local character-set encoding the server uses or even whether the URL encodes a string or something completely different. (Granted, many URLs *do* encode a human-readable string; and often, you can guess the encoding very easily. But it's not possible in the generally case or completely automatically.) – tripleee Jan 29 '18 at 12:45

5 Answers5

625

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:

from urllib.parse import unquote

url = unquote(url)

Demo:

>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'

The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:

from urllib import unquote

url = unquote(url).decode('utf8')
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 3
    So why is the + character left in the string? I thought that %2B was the + character and + literals were removed during decoding? – AlexLordThorsen Oct 02 '14 at 22:29
  • 12
    @Rawrgulmuffins `+` is a space in [`x-www-form-urlencoded` data](http://en.m.wikipedia.org/wiki/Application/x-www-form-urlencoded#The_application.2Fx-www-form-urlencoded_type); you'd use `urllib.parse.parse_qs()` to parse that, or use `urllib.parse.unquote_plus()`. But they should only appear in the query string, not the rest of the URL. – Martijn Pieters Oct 02 '14 at 23:14
165

If you are using Python 3, you can use urllib.parse.unquote:

url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""

import urllib.parse
urllib.parse.unquote(url)

gives:

'example.com?title=правовая+защита'
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
pavan
  • 3,225
  • 3
  • 25
  • 29
  • using this and getting a dict instead of query string on python3.8 – Clocker May 19 '20 at 16:41
  • @Clocker can't reproduce. Make sure to follow the example exactly. If you are having difficulty adapting it to your own needs, ask your own question, making sure to follow the advice in [ask] and [mre]. – Karl Knechtel Jan 10 '23 at 00:45
26

You can achieve an expected result with requests library as well:

import requests

url = "http://www.mywebsite.org/Data%20Set.zip"

print(f"Before: {url}")
print(f"After:  {requests.utils.unquote(url)}")

Output:

$ python3 test_url_unquote.py

Before: http://www.mywebsite.org/Data%20Set.zip
After:  http://www.mywebsite.org/Data Set.zip

Might be handy if you are already using requests, without using another library for this job.

ivanleoncz
  • 9,070
  • 7
  • 57
  • 49
2

In HTML the URLs can contain html entities. This replaces them, too.

#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))
Roland Puntaier
  • 3,250
  • 30
  • 35
2

I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.

So I quickly wrote my own.

Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.

URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.

So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.

Here is the code:

def url_parse(url):
    l = len(url)
    data = bytearray()
    i = 0
    while i < l:
        if url[i] != '%':
            d = ord(url[i])
            i += 1
        
        else:
            d = int(url[i+1:i+3], 16)
            i += 3
        
        data.append(d)
    
    return data.decode('utf8')

I have tested it and it works perfectly.

Ξένη Γήινος
  • 2,181
  • 1
  • 9
  • 35