RegEx for extracting domains and subdomains

Question

I'm trying to strip a bunch of websites down to their domain names i.e:

https://www.facebook.org/hello

becomes facebook.org.

I'm using the regex pattern finder:

(https?:\/\/)?([wW]{3}\.)?([\w]*.\w*)([\/\w]*)

This catches most cases but occasionally there will be websites such as:

http://www.xxxx.wordpress.com/hello

which I want to strip to xxxx.wordpress.com.

How can I identify those cases while still identifying all other normal entries?

Use [urllib.parse](https://docs.python.org/3/library/urllib.parse.html) to parse the URL components. — Robert Harvey, May 15 '19 at 21:36

score 3 · Accepted Answer · answered May 16 '19 at 03:43

You expression seems to be working perfectly fine and it outputs what you might want to. I only added an i flag and slightly modify it to:

(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)

RegEx

If this wasn't your desired expression, you can modify/change your expressions in regex101.com.

RegEx Circuit

You can also visualize your expressions in jex.im:

Python Code

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)"

test_str = ("https://www.facebook.org/hello\n"
    "http://www.xxxx.wordpress.com/hello\n"
    "http://www.xxxx.yyy.zzz.wordpress.com/hello")

subst = "\\3"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

JavaScript Demo

const regex = /(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)/gmi;
const str = `https://www.facebook.org/hello
http://www.xxxx.wordpress.com/hello
http://www.xxxx.yyy.zzz.wordpress.com/hello`;
const subst = `$3`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

Although Robert Harvey has suggested a useful method of urllib.parse, here's my attempt at the regex:

(?:http[s]?:\/\/)?(?:www\.)?([^/\n\r\s]+\.[^/\n\r\s]+)(?:/)?(\w+)?

As seen at regex101.com

Explanation -

First, the regex checks whether there is a https:// or http://. If so, it ignores it, but starts searching after that.

Then the regex checks for a www. - It's important to note that this has been kept optional, so if the user enters my website is site.com, site.com will be matched.

[^/\n\r\s]+\.[^/\n\r\s]+ matches the actual url you need, so it won't have spaces or newlines. Oh, and there must be at least one period (.) in there.

Since your question looks like you want to match the sub directory as well, I've added (\w+)? at the end.

TL;DR

Group 0 - Entire url

Group 1 - The domain name

Group 2 - The sub-directory

score 0 · Answer 3 · edited Nov 30 '20 at 16:40

print("-------------")

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

    import re
    
    regex = r"(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)"
    regex1 = r"\.?(microsoft.com.*)"
    test_str = (
    "https://blog.microsoft.com/test.html\n"
    "https://www.blog.microsoft.com/test/test\n"
    "https://microsoft.com\n"
    "http://www.blog.xyz.abc.microsoft.com/test/test\n"
    "https://www.microsoft.com")
    
    subst = "\\3"
    if test_str:
        print (test_str)
    
    print ("-----")
    # You can manually specify the number of replacements by changing the 4th argument
    result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
    if result:
        print (result)
    
    print ("-----")
    result = re.sub(regex1, "", result, 0, re.MULTILINE | re.IGNORECASE)
    if result:
        print (result)
    
    print ("-----")