0

I currently use the following method to clean websites.

http://www.example.com > example.com
https://www.example.com > example.com
http://example.com > example.com

However,

www.example.com > www.example.com

How can I make sure, www.example.com turns into example.com

import re

website = "http://www.example.com"
def clean_website(website):
    """
    Transform http://google.com, https://google.com, http://www.google.com and
    https:www.//google.com into google.com.
    """
    url = re.compile(r"https?://(www\.)?")
    return url.sub("", website).strip().strip("/")

clean_website(website)
Joey Coder
  • 3,199
  • 8
  • 28
  • 60

3 Answers3

2

try this :

import re

website = "http://www.test.com"
def clean_website(website):
    r = "^http.*\/\w*.?"
    x = re.findall(r,website)
    for i in x :
        website = website.replace(i,'')
    return(website)
Negar37
  • 352
  • 1
  • 8
1

You can use tldextract

import tldextract

def clean_website(url):
    # Example of ext if input is http://www.test.com
    ## ExtractResult(subdomain='www', domain='test', suffix='com')

    ext = tldextract.extract(url)

    return '.'.join(ext[1:]) # domain + suffix
kaihami
  • 815
  • 7
  • 18
1

You can make use of a custom Regex pattern as follows:

import re

website = "http://www.test.com"

url = re.compile(r'[a-zA-Z0-9]+.com') # custom Regex pattern

print(url.findall(website))

Output for all the examples in your description:

['test.com']

Feel free to put any special characters within [] in line 3, if required.

Bikramjeet Singh
  • 681
  • 1
  • 7
  • 22