3

I have the following regex:

r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

When I apply this to a text string with, let's say, "this is www.website1.com and this is website2.com", I get:

['www.website1.com']

['website.com']

How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...

Abhijeetk431
  • 847
  • 1
  • 8
  • 18
DDS
  • 65
  • 1
  • 6
  • Possible duplicate of [Extract all domains from text](https://stackoverflow.com/questions/21211572/extract-all-domains-from-text) – tripleee Nov 06 '18 at 07:32

2 Answers2

4

Try this one (thanks @SunDeep for the update):

\s(?:www.)?(\w+.com)

Explanation

\s matches any whitespace character

(?:www.)? non-capturing group, matches www. 0 or more times

(\w+.com) matches any word character one or more times, followed by .com

And in action:

import re

s = 'this is www.website1.com and this is website2.com'

matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)

Output:

['website1.com', 'website2.com']

A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.

This answer has a lot of helpful info about matching domains: What is a regular expression which will match a valid domain name without a subdomain?

Next, I only look for .com domains, you could adjust my regular expression to something like:

\s(?:www.)?(\w+.(com|org|net))

To match whichever types of domains you were looking for.

user3483203
  • 50,081
  • 9
  • 65
  • 94
0

Here a try :

import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)

O/P like :

'website1.com'

if it is s = "website1.com" also it will o/p like :

'website1.com'
Vikas Periyadath
  • 3,088
  • 1
  • 21
  • 33