2

I am looking to create a regex in python in order to extract ONLY the domains from the following the set of URLs at the bottom of this post. I have been using https://regexr.com/ in order to test out my regex before applying Series.str.extract(). So far, I have been able to get VERY close, but it looks like the first character (the first 'w' in www, where there is one included) is not being captured. The regex I have so far is this:

[^\/\/](\w*.\w*.com|\w*.\w*.org|\w*.\w*.cc|\w*.\w*.ly)

How can I modify this to go from http://css-cursor.techstream.org to only css-cursor.techstream.org

'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org'
Willy_Golden
  • 51
  • 1
  • 7

5 Answers5

1

Is the regex a hard requirement, because you need to combine it with an existing regex? If not there's an easy tool in the standard library that does it:

from urllib.parse import urlparse

urls = [
    'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
    'http://www.interactivedynamicvideo.com/',
    'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
    'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
    'HTTPS://github.com/keppel/pinn',
    'Http://phys.org/news/2015-09-scale-solar-youve.html',
    'https://iot.seeed.cc',
    'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
    'http://beta.crowdfireapp.com/?beta=agnipath',
    'https://www.valid.ly?param',
    'http://css-cursor.techstream.org',
]

domains = [urlparse(url).netloc for url in urls]
print(domains)

Well I guess the regex is faster:

>>> netloc = re.compile(r'^https?://([^/?^]+)', flags=re.I)                                                                                                    
>>> %timeit [netloc.match(url).group(1) for url in urls]                                                                                                       
5.66 µs ± 97.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit [urlparse(url).netloc for url in urls]                                                                                                             
23.3 µs ± 3.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Chris Wesseling
  • 6,226
  • 2
  • 36
  • 72
  • Hi! I was being asked to do this as part of an exercise to practice working with a regex. I appreciate you showing me urlparse! Currently working on figuring out what a non-capturing group is. – Willy_Golden Apr 22 '21 at 12:15
  • @Willy_Golden Have fun learning a powerful tool. If you are having trouble getting to grips with them, you could try a different point of view: and practise turning a given regex into a finite state automaton. I find approaching a subject from different angles helps forming the right intuition. And formal language theory [can be very amusing](https://stackoverflow.com/a/1732454/383793) – Chris Wesseling Apr 23 '21 at 09:01
0

I've changed it a bit to the following expression:

[^\/\/]([\w\-.]*\.(?:org|com|cc|ly))

  1. The . before the TLD is now escaped using \ (\., which means the character . and not "every character").
  2. I've added - and . to the host name (not only \w).
  3. I've grouped the TLDs (org, com, cc, ly) into a non-capturing group ((?:...)) - just to make the regular expression looks cleaner and eliminate repetitions.
Yam Mesicka
  • 6,243
  • 7
  • 45
  • 64
  • 1
    `sre_constants.error: bad character range \w-. at position 9`. (You need to backlash-escape the `-` or put it at the beginning or end of the character class.) – rici Apr 22 '21 at 15:30
0

According to regexr.com, this should do what you want and is simpler: (?<=\/\/)([^/?']*) . After all, the domain is literally everything after the // up to the next / or ? or end of string.

Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • I just tested this and it does not actually work! Your solution seems to include the bacslahses, and in one case, http! – Willy_Golden Apr 21 '21 at 23:42
  • Are you testing the strings separately, or are you including the apostrophes? You can fix it for the apostrophes. – Tim Roberts Apr 21 '21 at 23:54
0

For the example data, you can use an alternation for com org ly and cc and escape the dot to match it literally.

To match css-cursor.techstream.org you can use a repeated group matching either - or .

Note that [^\/\/] is the same as [^/] and matches any char except a /

\w+(?:[.-]\w+)*\.(?:ly|org|com|cc)\b
  • \w+ Match 1+ word chars
  • (?:[.-]\w+)* Optionally repeat matching either . or - and 1+ word chars
  • \. Match a liter dot (note to escape it)
  • (?:ly|org|com|cc) Non capture group, match any of the alternatives
  • \b A word boundary to prevent a partial match

Regex demo

If you also want to match the protocol, you can use a capture group for the string that you want.

\bhttps?://(\w+(?:[.-]\w+)*\.(?:ly|org|com|cc))\b

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

I added literal dot . and - dash to the regular expression

data=['https://www.amazon.com/Technology-Ventures-Enterprise-Thomas- 
 Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param',
 'http://css-cursor.techstream.org']

import re

pattern = re.compile(r'https?://([\w.\.\-]+)')

for data in data:
     match = pattern.match(data)
     if match:
         print(match.group(1))

output:

www.amazon.com
www.interactivedynamicvideo.com
www.nytimes.com
evonomics.com
iot.seeed.cc
www.bfilipek.com
beta.crowdfireapp.com
www.valid.ly
css-cursor.techstream.org
Golden Lion
  • 3,840
  • 2
  • 26
  • 35