Decode escaped characters in URL

Question

I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen when it recovers the html page:

http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh

Is there a way to transform them back to their unescaped form in python?

P.S.: The URLs are encoded in utf-8

score 183 · Accepted Answer · edited Feb 01 '22 at 15:46

183

Using urllib package (import urllib) :

Python 2.7

From official documentation :

urllib.unquote(string)

Replace %xx escapes by their single-character equivalent.

Example: unquote('/%7Econnolly/') yields '/~connolly/'.

Python 3

From official documentation :

urllib.parse.unquote(string, encoding='utf-8', errors='replace')

[…]

Example: unquote('/El%20Ni%C3%B1o/') yields '/El Niño/'.

edited Feb 01 '22 at 15:46

Skippy le Grand Gourou

6,976
4
60
76

answered Nov 15 '11 at 13:09

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

the unquote shows as i said above sample.com/index.php?title=\xe9\xa6\x96\xe9\xa1\xb5&action=edi ... maybe i did not explain myself very well in this case... but the url is a chinese one and i want to decode to it's original char, not the unquote one – Tony Nov 15 '11 at 13:33
4

@dyoser You need to put this in your question. – Kris Harper Nov 15 '11 at 13:46
@root45 this is a comment to one answer... so it's right good here. Thanks for your appreciation. – Tony Nov 15 '11 at 13:50
1

@dyoser My point is that you left out important information in your question. You asked an unclear question and got a downvote and two answers telling you the same thing. If you include the fact that you want to decode Chinese characters, people might be able to help you. Both of these people have correct answers. Your issue is dealing with Unicode. Your strings are UTF-16 which is not going to be handled by default. People may have known that if you included it in your question like I suggested. – Kris Harper Nov 15 '11 at 14:18
No, it's UTF-8. Specifically "首页". – Ignacio Vazquez-Abrams Nov 15 '11 at 14:21
@root45 it's not utf-16... as ignacio said is utf-8. And again, maybe the question is unclear maybe not. – Tony Nov 15 '11 at 18:48
@dyoser I corrected myself in the next comment. Maybe your question is perfectly clear, but given the downvote, the two identical answers and the lack of activity for four hours, I'd say you could benefit from editing your question and adding more information. But hey, it's up to you. I'm not interested in the answer. It's your loss. – Kris Harper Nov 15 '11 at 18:56
12

Just a note that for python3, this is `urllib.parse.unquote` – tayfun Aug 21 '15 at 11:14
unquote decodes only %xx escape sequences. It doesn't replace the plus sign (+) with an ASCII space ' '. For that, unquote_plus() must be used. – Kuro Aug 24 '15 at 22:45
5

For python3 it is also in `urllib.request.unquote` – Ben Nov 23 '16 at 07:31
In Python 3, I'm using `urlparse` and then `unquote` to get unquoted versions of the username and password from a URL. But I'm surprised there isn't a one-step way to do it. Have I missed something? – Michael Scheper Oct 22 '18 at 17:12
@Ben This is because `urllib.request` [imports several methods](https://github.com/python/cpython/blob/3.10/Lib/urllib/request.py) from `urllib.parse`. Better import the former if you only need `unquote`, but no need to import it if you already need the latter for something else, I guess. – Skippy le Grand Gourou Feb 01 '22 at 15:54

score 39 · Answer 2 · edited Oct 14 '20 at 19:22

39

And if you are using Python3 you could use:

import urllib.parse
urllib.parse.unquote(url)

edited Oct 14 '20 at 19:22

Cadoiz

1,446
21
31

answered Jan 04 '16 at 15:03

Vladir Parrado Cruz

2,301
21
27

Also in `urllib.request.unquote` – Ben Nov 23 '16 at 07:32

score 18 · Answer 3 · answered Dec 10 '15 at 04:27

or urllib.unquote_plus

>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'

score 7 · Answer 4 · answered Nov 15 '11 at 13:09

7

You can use urllib.unquote

answered Nov 15 '11 at 13:09

Klaus Byskov Pedersen

117,245
29
183
222

when i use the unquote (thanks by the way...) it shows this string http://sample.com/index.php?title=\xe9\xa6\x96\xe9\xa1\xb5&action=edi and i know they're chinese chars... how can i see them? I guess this is unicode, right? – Tony Nov 15 '11 at 13:25
That's in your question already. Those are the UTF-8 bytes; you can convert them to a Unicode string with `b"\xe9\xa6\x96\xe9\xa1\xb5".decode("utf-8")` (using somewhat more modern Python syntax now). – tripleee Dec 03 '17 at 13:27

score 4 · Answer 5 · answered Mar 26 '13 at 00:27

4

import re

def unquote(url):
  return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)

answered Mar 26 '13 at 00:27

mistercx

791
6
3

8

Why would you manually use regex and lambdas when there's a built in library that does what you need, probably even more thoughtfully? – Brad Koch Sep 28 '13 at 02:26
8

Cool solution! `urllib2` is not part of standard python distri. `re` is. – cxxl Nov 11 '14 at 10:25
1

parsing html with regex isn't usually the best idea. – Jhirschibar Aug 23 '21 at 18:57
This answer is incorrect - it only handles encoded ASCII. If the URL has any encoded non-ASCII this'll corrupt it, since *each UTF-8 byte* is encoded as a separate % escape. You instead need to do a uft8-decode on entire sequences of consecutive escapes, and replace them with the 1 or more characters that result. – Xanthir Feb 27 '23 at 23:21

Decode escaped characters in URL

5 Answers5

Python 2.7

Python 3

Linked

Related