Building a regex to extract domains ONLY

Question

I am looking to create a regex in python in order to extract ONLY the domains from the following the set of URLs at the bottom of this post. I have been using https://regexr.com/ in order to test out my regex before applying Series.str.extract(). So far, I have been able to get VERY close, but it looks like the first character (the first 'w' in www, where there is one included) is not being captured. The regex I have so far is this:

[^\/\/](\w*.\w*.com|\w*.\w*.org|\w*.\w*.cc|\w*.\w*.ly)

How can I modify this to go from http://css-cursor.techstream.org to only css-cursor.techstream.org

'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org'

Chris Wesseling · Answer 1 · 2021-04-22T00:31:13.740

Is the regex a hard requirement, because you need to combine it with an existing regex? If not there's an easy tool in the standard library that does it:

from urllib.parse import urlparse

urls = [
    'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
    'http://www.interactivedynamicvideo.com/',
    'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
    'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
    'HTTPS://github.com/keppel/pinn',
    'Http://phys.org/news/2015-09-scale-solar-youve.html',
    'https://iot.seeed.cc',
    'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
    'http://beta.crowdfireapp.com/?beta=agnipath',
    'https://www.valid.ly?param',
    'http://css-cursor.techstream.org',
]

domains = [urlparse(url).netloc for url in urls]
print(domains)

Well I guess the regex is faster:

>>> netloc = re.compile(r'^https?://([^/?^]+)', flags=re.I)                                                                                                    
>>> %timeit [netloc.match(url).group(1) for url in urls]                                                                                                       
5.66 µs ± 97.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit [urlparse(url).netloc for url in urls]                                                                                                             
23.3 µs ± 3.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Hi! I was being asked to do this as part of an exercise to practice working with a regex. I appreciate you showing me urlparse! Currently working on figuring out what a non-capturing group is. — Willy_Golden, Apr 22 '21 at 12:15
@Willy_Golden Have fun learning a powerful tool. If you are having trouble getting to grips with them, you could try a different point of view: and practise turning a given regex into a finite state automaton. I find approaching a subject from different angles helps forming the right intuition. And formal language theory [can be very amusing](https://stackoverflow.com/a/1732454/383793) — Chris Wesseling, Apr 23 '21 at 09:01

Yam Mesicka · Answer 2 · 2021-04-22T16:38:02.343

0

I've changed it a bit to the following expression:

[^\/\/]([\w\-.]*\.(?:org|com|cc|ly))

The . before the TLD is now escaped using \ (\., which means the character . and not "every character").
I've added - and . to the host name (not only \w).
I've grouped the TLDs (org, com, cc, ly) into a non-capturing group ((?:...)) - just to make the regular expression looks cleaner and eliminate repetitions.

edited Apr 22 '21 at 16:38

answered Apr 21 '21 at 23:31

Yam Mesicka

6,243
7
45
64

1

`sre_constants.error: bad character range \w-. at position 9`. (You need to backlash-escape the `-` or put it at the beginning or end of the character class.) – rici Apr 22 '21 at 15:30

Tim Roberts · Answer 3 · 2021-04-22T00:19:34.463

0

According to regexr.com, this should do what you want and is simpler: (?<=\/\/)([^/?']*) . After all, the domain is literally everything after the // up to the next / or ? or end of string.

edited Apr 22 '21 at 00:19

answered Apr 21 '21 at 23:34

Tim Roberts

48,973
4
21
30

I just tested this and it does not actually work! Your solution seems to include the bacslahses, and in one case, http! – Willy_Golden Apr 21 '21 at 23:42
Are you testing the strings separately, or are you including the apostrophes? You can fix it for the apostrophes. – Tim Roberts Apr 21 '21 at 23:54

The fourth bird · Answer 4 · 2021-04-21T23:51:34.457

For the example data, you can use an alternation for com org ly and cc and escape the dot to match it literally.

To match css-cursor.techstream.org you can use a repeated group matching either - or .

Note that [^\/\/] is the same as [^/] and matches any char except a /

\w+(?:[.-]\w+)*\.(?:ly|org|com|cc)\b

\w+ Match 1+ word chars
(?:[.-]\w+)* Optionally repeat matching either . or - and 1+ word chars
\. Match a liter dot (note to escape it)
(?:ly|org|com|cc) Non capture group, match any of the alternatives
\b A word boundary to prevent a partial match

Regex demo

If you also want to match the protocol, you can use a capture group for the string that you want.

\bhttps?://(\w+(?:[.-]\w+)*\.(?:ly|org|com|cc))\b

Regex demo

score 0 · Answer 5 · answered Sep 08 '21 at 17:38

I added literal dot . and - dash to the regular expression

data=['https://www.amazon.com/Technology-Ventures-Enterprise-Thomas- 
 Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org']

import re

pattern = re.compile(r'https?://([\w.\.\-]+)')

for data in data:
     match = pattern.match(data)
     if match:
         print(match.group(1))

output:

www.amazon.com
www.interactivedynamicvideo.com
www.nytimes.com
evonomics.com
iot.seeed.cc
www.bfilipek.com
beta.crowdfireapp.com
www.valid.ly
css-cursor.techstream.org

Building a regex to extract domains ONLY

5 Answers5