1

I am trying to create a regex filter that will be used to sanitize domains that are processed by a python script.

The domains could possibly be just regular domain names

  • something.com, some.something.com

or could have a url structure

or could have url structure with www

I currently have a crude regex to pull out domains out of these structures except I have not figured out a way to filter out the www. out.

(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-@]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,11}

This regex does a decent job grabbing domains out of urls, but when I try to do any kind of negative lookahead to remove the www.,I can't seem to get the desired result. I've tried (?!www.) which only took away one w not all 3 and the ., any help figuring this out would be most appreciated.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
user2292661
  • 145
  • 1
  • 2
  • 9
  • 3
    Does it have to be regex? Why not use [`urlparse`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse)? – Aran-Fey Feb 01 '18 at 16:24
  • Is [this](https://regex101.com/r/NZsPze/1) what you're looking for? – ctwheels Feb 01 '18 at 16:28
  • That one pulls out the domains that do not have the www. in them. I want the domains with the www., but with the www. removed and just the second level domain "something.com" as the match. – user2292661 Feb 02 '18 at 16:26

4 Answers4

3

Unless you absolutely have to use regex, it's better to use something designed for this - like the built-in urlparse. For one thing, your regex (and the one linked in the comments) won't match domains with non-ASCII characters.

>>> from urlparse import urlparse # Python 2
>>> # from urllib.parse import urlparse # Python 3

>>> urlparse('http://www.some.domain/the/path')
ParseResult(scheme='http', netloc='www.some.domain', path='/the/path', params='', query='', fragment='')
>>> urlparse('http://www.some.domain/the/path').netloc
'www.some.domain'

Note that you might want to detect strings without scheme and add it:

>>> url = 'www.other.domain'
>>> urlparse(url)
ParseResult(scheme='', netloc='', path='www.other.domain', params='', query='', fragment='')
>>> if not urlparse(url).scheme:
...     print urlparse('http://' + url)
ParseResult(scheme='http', netloc='www.other.domain', path='', params='', query='', fragment='')

so you always get the domain in the netloc attribute of the ParseResult.

Once you have the domain separated out, if you want to remove the 'www.', there are any number of simple ways to do it.

Nathan Vērzemnieks
  • 5,495
  • 1
  • 11
  • 23
1

Use urlparse. Below is a sample of using it. I find it easy using urlparse over regex. urlparse gives you a class of parse results, and we can choose to extract the item we want and then do desired logical operations to extract the required host.

>>> from urlparse import urlparse
>>> u1 = "https://example.com"
>>> d1 = urlparse(u1)
>>> d1.hostname
'example.com'

>>> u2 = 'https://www.123.com'
>>> h = urlparse(u2)
>>> host = h.hostname
>>> host[4:]
'123.com'

>>> u3 = 'something.com'
>>> d3 = urlparse(u3)
>>> if bool(d3.netloc):
...     print(d3)
... else:
...     print d3.path
... 
something.com

>>> d4 = 'somenew.net/pathis/123'
>>> u4 = urlparse(d4)
'somenew.net/pathis/123'
>>> u4.path.split('/')[0]
'somenew.net'
Ajay2588
  • 527
  • 3
  • 6
  • Down vote? Please explain the reason. That would help – Ajay2588 Feb 01 '18 at 16:45
  • I did not downvote your answer, but in general answers with just code are not as well received as answers that explain _why_ the approach you suggest is good. Also, your answer doesn't address the question fully, since it doesn't work for urls without a scheme (eg. `http://`) at the beginning, which the asker specifically mentions. – Nathan Vērzemnieks Feb 01 '18 at 21:22
  • Appreciate your concern and Thanks for the comment. I am now thinking I would have written some theory for choosing urlparse oevr regex. I will address leftover part of the actual question. Again, Thanks a lot. – Ajay2588 Feb 02 '18 at 05:45
0

Try

((?:[a-z][a-z0-9-]*[a-z0-9]\.|[a-z]\.)(?<!\bwww\.)
 (?:[a-z][a-z0-9-]*[a-z0-9]\.|[a-z]\.)*
 (?:[a-z][a-z0-9-]*[a-z0-9]|[a-z]))

And examples

Explain:

  • [a-z][a-z0-9-]*[a-z0-9]\.|[a-z]\. match a label as described by RFC 1034
  • (?<!\bwww\.) assert that the domain does not begin with www.. The part matched by (?:[a-z][a-z0-9-]*[a-z0-9]\.|[a-z]\.) on the first line.

To match the simplest URL scheme (no auth part) as well, use this:

https?://
(?:www\.)?
((?:[a-z][a-z0-9-]*[a-z0-9]\.|[a-z]\.)+
 (?:[a-z][a-z0-9-]*[a-z0-9]|[a-z]))

Note that re.VERBOSE is used to ignore whitespace in the pattern so it look like more readable.

Community
  • 1
  • 1
Aaron
  • 1,255
  • 1
  • 9
  • 12
  • Note that this doesn't correctly handle domains that start with numbers, or that contain underscores or non-ascii characters, all of which are allowed. This is why it's not a good idea to write your own one-off parsing: there are so many subtleties. – Nathan Vērzemnieks Feb 01 '18 at 22:14
  • It may not be a good idea to use regex to match URLs. See https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url – Aaron Feb 02 '18 at 08:41
0

try below

import re
from urllib.parse import urlparse



def parse_url(url):
    url_parse=urlparse(url)
    if not url_parse.scheme:
        url='http://' + url
        url_parse=urlparse(url)
    domain=urlparse(url).netloc
    domain = re.sub(r"^www.", "", domain)
    return(domain)

url='https://www.facebuk.com'
print(parse_url(url))
>>> facebuk.com

url='www.facebuk.com'
print(parse_url(url))
>>> facebuk.com

Monu
  • 2,092
  • 3
  • 13
  • 26