Remove every character before the website name in a URL

Question

For example if I have https://stackoverflow.com/questions/ask I'd like to cut it to stackoverflow.com/questions/ask or if I have http://www.samsung.com/au/ I'd like to cut it to samsung.com/au/.

I want to make a template tag for this but not sure what to return:

def clean_url(url):
    return ?

template

{{ url|clean_url }}

Any idea?

On advertised posts on my site I want to show the site the post links to - but I want it to look clean without the `https` or `www` etc — Zorgan, Apr 14 '18 at 00:44
See [here](https://stackoverflow.com/a/286194/1081569), but in Python 3 it's `from urllib.parse import urlparse`. — Paulo Almeida, Apr 14 '18 at 00:50
@ivan_pozdeev, that was uncalled for. Not everyone has your though process when it comes to typing what to search for. — gahooa, Apr 14 '18 at 00:53
Possible duplicate of [How to split a web address](https://stackoverflow.com/questions/286150/how-to-split-a-web-address) — Paulo Almeida, Apr 14 '18 at 00:54
@gahooa https://meta.stackoverflow.com/questions/355550/what-should-i-do-with-a-question-that-is-too-simple — ivan_pozdeev, Apr 14 '18 at 00:58

score 2 · Answer 1 · answered Apr 14 '18 at 00:52

2

Here is a quick and dirty way to isolate the domain provided it starts with something//

def clean(url):
  return url.partition('//')[2].partition('/')[0]

answered Apr 14 '18 at 00:52

gahooa

131,293
12
98
101

Or just use the quick and clean `urllib.parse.urlparse(url).netloc` :) He wanted to keep the path though, not just the domain. – Paulo Almeida Apr 14 '18 at 01:05

codingatty · Answer 2 · 2018-04-14T01:11:03.133

urllib.parse will do most of this for you:

import urllib.parse
def clean_url(url):
    parts = list(urllib.parse.urlsplit(url))
    parts[0]=""
    cleaned = urllib.parse.urlunsplit(parts)[2:]
    return cleaned

Note this does not cut off the "www.", but you shouldn't do that; that can be a critical part of the domain name. If you really want that, add:

if cleaned.startswith("www."):
    cleaned = cleaned[4:]

Zev · Answer 3 · 2018-04-14T01:04:21.250

For the use cases, you described. You can just split on the double backslash and go with that or work from there.

def clean_url(url):
    clean = url.split('//')[1]
    if clean[0:4] == 'www.':
        return clean[4:]
    return clean

However, because the subdomain (such as 'www') can be used as a significant part of the url, you may want to keep that in. For example, www.pizza.com and pizza.com could be links to different pages.

Other things to consider are the urlparse library or regex but they may be overkill for this.

Remove every character before the website name in a URL

3 Answers3