Clean www.test.com to test.com

Question

I currently use the following method to clean websites.

http://www.example.com > example.com
https://www.example.com > example.com
http://example.com > example.com

However,

www.example.com > www.example.com

How can I make sure, www.example.com turns into example.com

import re

website = "http://www.example.com"
def clean_website(website):
    """
    Transform http://google.com, https://google.com, http://www.google.com and
    https:www.//google.com into google.com.
    """
    url = re.compile(r"https?://(www\.)?")
    return url.sub("", website).strip().strip("/")

clean_website(website)

Please use example.com or *.example for examples because they are reserved for that kind of [purposes](https://tools.ietf.org/html/rfc2606). — Wouterr, Jan 04 '20 at 16:04

Negar37 · Answer 1 · 2020-01-04T16:19:52.250

2

try this :

import re

website = "http://www.test.com"
def clean_website(website):
    r = "^http.*\/\w*.?"
    x = re.findall(r,website)
    for i in x :
        website = website.replace(i,'')
    return(website)

edited Jan 04 '20 at 16:19

answered Jan 04 '20 at 16:03

Negar37

352
1
8

This wont remove other sub domains. – Sayse Jan 04 '20 at 16:07
1

now it includes sub domains matching :) – Negar37 Jan 04 '20 at 16:20
Still doesn't support sftp, ports or any url that includes a path after the suffix, the duplicate offers much more robust solutions – Sayse Jan 04 '20 at 16:23
1

Thank you Negar37 for the help. I went now with `tldextract`. That solved with with the least amount of code. – Joey Coder Jan 04 '20 at 16:28

score 1 · Accepted Answer · answered Jan 04 '20 at 16:03

You can use tldextract

import tldextract

def clean_website(url):
    # Example of ext if input is http://www.test.com
    ## ExtractResult(subdomain='www', domain='test', suffix='com')

    ext = tldextract.extract(url)

    return '.'.join(ext[1:]) # domain + suffix

score 1 · Answer 3 · answered Jan 04 '20 at 16:08

1

You can make use of a custom Regex pattern as follows:

import re

website = "http://www.test.com"

url = re.compile(r'[a-zA-Z0-9]+.com') # custom Regex pattern

print(url.findall(website))

Output for all the examples in your description:

['test.com']

Feel free to put any special characters within [] in line 3, if required.

answered Jan 04 '20 at 16:08

Bikramjeet Singh

681
1
7
22

`[a-zA-Z0-9]+` is a bit short for domain name – Toto Jan 04 '20 at 16:32

Clean www.test.com to test.com

3 Answers3