Function in Python to clean up and normalize a URL

Question

I am using URLs as a key so I need them to be consistent and clean. I need a python function that will take a URL and clean it up so that I can do a get from the DB. For example, it will take the following:

example.com
example.com/
http://example.com/
http://example.com
http://example.com?
http://example.com/?
http://example.com//

and output a clean consistent version:

http://example.com/

I looked through std libs and GitHub and couldn't find anything like this

Update

I couldn't find a Python library that implements everything discussed here and in the RFC:

http://en.wikipedia.org/wiki/URL_normalization

So I am writing one now. There is a lot more to this than I initially imagined.

The standardized cleaned up form should be `http://example.com/` rather than `http://example.com`, an HTTP URL without a path component is technically not well formed. — mu is too short, Mar 10 '11 at 16:45
You need to define clean. Does it mean an absolute URL? Or a canonical URL? — Wai Yip Tung, Mar 10 '11 at 18:29
What I really mean is normalize, which is the word that didn't really come to me when I typed this up at 5am. The urlparse() looks like what I want, I didn't notice the normalizing aspects of that function when I was reading through the docs early this morning. — nikcub, Mar 11 '11 at 06:39
see this post http://stackoverflow.com/questions/5371992/comparing-two-urls-in-python/18690955#18690955 — farmer1992, Sep 12 '13 at 12:26
possible duplicate of [How can I normalize a URL in python](http://stackoverflow.com/questions/120951/how-can-i-normalize-a-url-in-python) — llrs, Jun 20 '14 at 15:28

ducu · Answer 1 · 2014-06-20T15:23:13.557

9

It's done in scrapy:

http://nullege.com/codes/search/scrapy.utils.url.canonicalize_url

Canonicalize the given url by applying the following procedures:

sort query arguments, first by key, then by value

percent encode paths and query arguments. non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)

normalize all spaces (in query arguments) '+' (plus symbol)

normalize percent encodings case (%2f -> %2F)

remove query arguments with blank values (unless keep_blank_values is True)

remove fragments (unless keep_fragments is True)

edited Jun 20 '14 at 15:23

answered Jun 20 '14 at 15:02

ducu

1,199
2
12
14

3

At least these days Scrapy imports this function from the [w3lib package](http://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.url.canonicalize_url). – Dawn Drescher Dec 23 '16 at 12:22

score 9 · Accepted Answer · edited Sep 26 '18 at 20:22

9

Take a look at urlparse.urlparse(). I've had good success with it.

note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse module has been named to urllib.parse. The corresponding Python3 documentation for urllib.parse can be found here:

https://docs.python.org/3/library/urllib.parse.html

edited Sep 26 '18 at 20:22

Brian Redbeard

690
10
22

answered Mar 10 '11 at 16:19

samplebias

37,113
6
107
103

1

Along with urlparse.urlunparse(). – jd. Mar 10 '11 at 16:20
Thanks for that - for some reason I missed the normalization aspect of that function when I was reading the docs early this morning. Took me a few minutes to implement – nikcub Mar 11 '11 at 06:40
2

scratch that - the normalization fails on 70%+ of my test cases (I have 50 tests now). For some reason the python community was against implementing normalization per the RFC and per how browsers handle it: http://en.wikipedia.org/wiki/URL_normalization I found this python bug: http://bugs.python.org/issue4191 – nikcub Mar 11 '11 at 15:07
1

to add, the urlparse normalization will not find those above URL's in the question to all be equal to each other, which is what is important. – nikcub Mar 11 '11 at 15:08
2

The Link is dead – Pedro Lobito Apr 16 '17 at 02:20
It doesn't clean certain cases like : `print(urlparse.urlparse('http://example.com//episode/').geturl())`. At least, it didn't clean out the `//` in the url in Python 2 – Xonshiz Jul 30 '19 at 09:26

Antony · Answer 3 · 2021-02-13T08:27:41.193

url-normalize might be what you're looking for.

Depending on your preference you can also probably:

remove UTM parameters
remove http(s)://
remove www.
remove trailing /

here is an example which does this:

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['example.com',
'example.com/',
'http://example.com/',
'http://example.com',
'http://example.com?',
'http://example.com/?',
'http://example.com//',
'http://example.com?utm_source=Google']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)

    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

Which gives this result:

['example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com']

There are still issues with shortened links and redirects of various sorts but you'd need to make a request to the url to sort through those.

score -5 · Answer 4 · answered Mar 10 '11 at 16:25

-5

Have you considered using regular xpressions? They will help you check for malformed URLs. I have used this in one of my applications

"^[, .a-zA-Z0-9]*$"

answered Mar 10 '11 at 16:25

Chinmoy

1,750
2
21
45

2

It does not answer the question. – Alexandre V. Aug 05 '18 at 11:04

Function in Python to clean up and normalize a URL

4 Answers4