How can I normalize a URL in python

Question

I'd like to know do I normalize a URL in python.

For example, If I have a url string like : "http://www.example.com/foo goo/bar.html"

I need a library in python that will transform the extra space (or any other non normalized character) to a proper URL.

There is a more up-to-date answer on StackOverflow here: http://stackoverflow.com/questions/10584861/canonize-normalize-an-url-in-python/15629657 — stuckintheshuck, Apr 08 '13 at 21:31

score 76 · Answer 1 · edited Aug 23 '19 at 18:22

76

Have a look at this module: werkzeug.utils. (now in werkzeug.urls)

The function you are looking for is called "url_fix" and works like this:

>>> from werkzeug.urls import url_fix
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

It's implemented in Werkzeug as follows:

import urllib
import urlparse

def url_fix(s, charset='utf-8'):
    """Sometimes you get an URL by a user that just isn't a real
    URL because it contains unsafe characters like ' ' and so on.  This
    function can fix some of the problems in a similar way browsers
    handle data entered by the user:

    >>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
    'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

    :param charset: The target charset for the URL if the url was
                    given as unicode string.
    """
    if isinstance(s, unicode):
        s = s.encode(charset, 'ignore')
    scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)
    path = urllib.quote(path, '/%')
    qs = urllib.quote_plus(qs, ':&=')
    return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

edited Aug 23 '19 at 18:22

jstaab

3,449
1
27
40

answered Sep 23 '08 at 13:33

Armin Ronacher

31,998
13
65
69

While this is from a http rfc2616 probably the more accurate solution, I think it's overkill, or do I miss something? – Florian Bösch Sep 23 '08 at 13:40
1

Yes. You probably missed the question. He has an URL from user input and wants to properly convert it into a real URL. (Aka: do what the firefox location bar does) – Armin Ronacher Sep 23 '08 at 14:14
3

`url_fix` is now located at `werkzeug.urls` – sebpiq Sep 11 '12 at 13:53
@ArminRonacher This function is great but unfortunately it does not perform full [syntax-based normalization](https://tools.ietf.org/html/rfc3986?#section-6.2.2), that is to say case normalization + percent-encoding normalisation + path segment normalization, nor [scheme-based normalization](https://tools.ietf.org/html/rfc3986?#section-6.2.3), as defined in RFC 3986. Do you know any Python library (standard or not) that is able to do it? I cannot believe that Python does not have such a basic standard feature. – Géry Ogam Aug 28 '19 at 15:10
It was deprecated and will be removed in Werkzeug 2.4. https://github.com/pallets/werkzeug/blob/main/src/werkzeug/urls.py – pymen Apr 17 '23 at 10:33

score 59 · Answer 2 · answered May 10 '09 at 16:15

59

Real fix in Python 2.7 for that problem

Right solution was:

 # percent encode url, fixing lame server errors for e.g, like space
 # within url paths.
 fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

For more information see Issue918368: "urllib doesn't correct server returned urls"

answered May 10 '09 at 16:15

Oleg Sakharov

1,127
2
20
19

4

Excellent answer, concise and helpful. Since this change was inside urllib, code that wishes to do the same should `import urllib` and call `urllib.quote()` with the parameters above. – Quinn Taylor Jan 13 '11 at 05:33
This barfs on the letter ä, but I give it my vote cause it's simple, and doesn't requite yet another import. – mlissner Feb 11 '11 at 04:21

Blair Conrad · Answer 3 · 2008-09-24T10:57:49.893

use urllib.quote or urllib.quote_plus

From the urllib documentation:

quote(string[, safe])

Replace special characters in string using the "%xx" escape. Letters, digits, and the characters "_.-" are never quoted. The optional safe parameter specifies additional characters that should not be quoted -- its default value is '/'.

Example: quote('/~connolly/') yields '/%7econnolly/'.

quote_plus(string[, safe])

Like quote(), but also replaces spaces by plus signs, as required for quoting HTML form values. Plus signs in the original string are escaped unless they are included in safe. It also does not have safe default to '/'.

EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as @ΤΖΩΤΖΙΟΥ points out:

>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')
>>> quoted_url
'http%3A//www.example.com/foo%20goo/bar.html'
>>> urllib2.urlopen(quoted_url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python25\lib\urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "c:\python25\lib\urllib2.py", line 373, in open
    protocol = req.get_type()
  File "c:\python25\lib\urllib2.py", line 244, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html

@ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.

So, what does urllib.quote return given the question's example url? — tzot, Sep 23 '08 at 13:51
Garbage. Why is an obviously wrong answer accepted as solution? — Armin Ronacher, Sep 23 '08 at 14:13
@ΤΖΩΤΖΙΟΥ: excellent point. Addressed @Armin Ronacher: possibly because the answerer and accepter weren't aware of the problem - not all problems are obvious to all. — Blair Conrad, Sep 23 '08 at 14:27
suggested edit: "…and only encode the hostname" → "…and only quote the path" — tzot, Sep 23 '08 at 14:58
Of course, @ΤΖΩΤΖΙΟΥ . Thanks! Sometimes I don't know where I leave my brain. — Blair Conrad, Sep 24 '08 at 10:58

score 13 · Answer 4 · answered Jun 07 '09 at 16:35

Because this page is a top result for Google searches on the topic, I think it's worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports, character case, lack of trailing slashes, etc.

When the Atom syndication format was being developed, there was some discussion on how to normalize URLs into canonical format; this is documented in the article PaceCanonicalIds on the Atom/Pie wiki. That article provides some good test cases.

I believe that one result of this discussion was Mark Nottingham's urlnorm.py library, which I've used with good results on a couple projects. That script doesn't work with the URL given in this question, however. So a better choice might be Sam Ruby's version of urlnorm.py, which handles that URL, and all of the aforementioned test cases from the Atom wiki.

tzot · Answer 5 · 2019-08-29T09:41:13.440

10

Py3

from urllib.parse import urlparse, urlunparse, quote
def myquote(url):
    parts = urlparse(url)
    return urlunparse(parts._replace(path=quote(parts.path)))

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/~user/with%20space/index.html?a=1&b=2'

Py2

import urlparse, urllib
def myquote(url):
    parts = urlparse.urlparse(url)
    return urlparse.urlunparse(parts[:2] + (urllib.quote(parts[2]),) + parts[3:])

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/%7Euser/with%20space/index.html?a=1&b=2'

This quotes only the path component.

edited Aug 29 '19 at 09:41

answered Sep 23 '08 at 13:25

tzot

92,761
29
141
204

2

That just quote all characters. That won't help him. – Armin Ronacher Sep 23 '08 at 13:33
In this example, it would also quote the ':' character (not all). Thanks for the comment. – tzot Sep 23 '08 at 13:49

score 4 · Answer 6 · answered Dec 16 '09 at 03:43

4

Just FYI, urlnorm has moved to github: http://gist.github.com/246089

answered Dec 16 '09 at 03:43

Mark Nottingham

5,546
1
25
21

score 2 · Answer 7 · answered Mar 05 '17 at 15:12

Valid for Python 3.5:

import urllib.parse

urllib.parse.quote([your_url], "\./_-:")

example:

import urllib.parse

print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))

the output will be http://www.example.com/foo%20goo/bar.html

Font: https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote

score 1 · Answer 8 · answered Jun 13 '14 at 10:45

I encounter such an problem: need to quote the space only.

fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]") do help, but it's too complicated.

So I used a simple way: url = url.replace(' ', '%20'), it's not perfect, but it's the simplest way and it works for this situation.

score 1 · Answer 9 · answered Jul 21 '22 at 12:16

A lot of answers here talk about quoting URLs, not about normalizing them.

The best tool to normalize urls (for deduplication etc.) in Python IMO is w3lib's w3lib.url.canonicalize_url util.

Taken from the official docs:

Canonicalize the given url by applying the following procedures:

 - sort query arguments, first by key, then by value
percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
 - percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
 - normalize all spaces (in query arguments) ‘+’ (plus symbol)
 - normalize percent encodings case (%2f -> %2F)
 - remove query arguments with blank values (unless keep_blank_values is True)
 - remove fragments (unless keep_fragments is True)
 - List item

The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).

>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

I've used this util with great success when broad crawling the web to avoid duplicate requests because of minor url differences (different parameter order, anchors etc)

How can I normalize a URL in python

9 Answers9

Py3

Py2

Linked

Related