3

I need a regexp to strip out just the domain name part of a url. So for example if i had the following url:

http://www.website-2000.com

the bit I'd want the regex to match would be 'website-2000'

If you could also explain which each part of the regex does to help me understand it that would be great.

Thanks

geoffs3310
  • 13,640
  • 23
  • 64
  • 85
  • Possible duplicate of [Domain name validation with RegEx](https://stackoverflow.com/questions/10306690/domain-name-validation-with-regex) – csilk Nov 22 '17 at 00:22

5 Answers5

11

This one should work. There might be some faults with it, but none that I can think of right now. If anyone want to improve on it, feel free to do so.

/http:\/\/(?:www\.)?([a-z0-9\-]+)(?:\.[a-z\.]+[\/]?).*/i

http:\/\/            matches the "http://" part
(?:www\.)?           is a non-capturing group that matches zero or one "www."
([a-z0-9\-]+)        is a capturing group that matches character ranges a-z, 0-9
                     in addition to the hyphen. This is what you wanted to extract.
(?:\.[a-z\.]+[\/]?)  is a non-capturing group that matches the TLD part (i.e. ".com",
                     ".co.uk", etc) in addition to zero or one "/"
.*                   matches the rest of the url

http://rubular.com/r/ROz13NSWBQ

hlindset
  • 440
  • 2
  • 7
  • 1
    The `.*` in the end is wrong. Replace it with `[^ ]*`. It also captures characters after the domain name. For eg, in `http://www.website-2000.com jerry hates tom`, `jerry hates tom` will also be captured by regex. Not in scope of question, but will help for a broader usage of your regex. – Anshit Chaudhary Sep 28 '17 at 10:50
4

Let me introduce you this wonderful tool txt2re: regular expression generator

Here you can experiment with regex and generate code in many languages.

shanethehat
  • 15,460
  • 11
  • 57
  • 87
realbot
  • 309
  • 2
  • 3
1
r/^[^:]+:\/\/[^/?#]+//

This worked for me.

It will match any scheme or protocol and then after the :// matches any character that's not a / ? or #. These three characters, when they first occur in a URL, signal the end of the domain so that's were I end the match.

zeffdotorg
  • 11
  • 2
0
http://wwww.([^/]+)

No need to use regexp, use the urlparse module

>>> from urlparse import urlparse
>>> '.'.join(urlparse("http://www.website-2000.com").netloc.split('.')[-2:])
'website-2000.com'

Kimvais
  • 38,306
  • 16
  • 108
  • 142
-1

This one allows you not to have to worry about any of the http/https/ftp etc... in front and also captures all your subdomains too.

(?:www\.)?([a-z0-9\-.]+)(?:\.[a-z\.]+[\/]?).*/i

The only times it fails that I've found are: - If a . precedes the domain/subdomain without any text before it, the . is included in the regex capture. - Emails with . in them will not work. (fix this by checking passed domain first for the @ symbol before running through regex) - Whitespace in the middle of the domain/subdomain

animuson
  • 53,861
  • 28
  • 137
  • 147
bradbyu
  • 99
  • 1