0

I have a list of domain names like this:

usatoday.com
detroitnews.com
virust.com
ajkdfabbbbbbb.net
ha.box.sk
www.test.net
rp.fff.com

I am trying to write a regex to be able to match all of the said domains.

For the domains, here is my regex but it doesn't work that well:

import re
s='dd.ddd.com rp.ff.com usatoday.net'
d= re.compile(r'(?<!\S)(([a-zA-Z]{1})|([a-zA-Z]{1}[a-zA-Z]{1})|([a-zA-Z]{1}[0-9]{1})|([0-9]{1}[a-zA-Z]{1})|([a-zA-Z0-9][a-zA-Z0-9-_]{1,61}[a-zA-Z0-9]))\.([a-zA-Z]{2,6}|[a-zA-Z0-9-]{2,30}\.[a-zA-Z]{2,3})(?!\S)')

result = d.findall(s)
print(result)

Output:

[('dd', '', 'dd', '', '', '', 'ddd.com'), ('rp', '', 'rp', '', '', '', 'ff.com'), ('usatoday', '', '', '', '', 'usatoday', 'net')]

I need the output to be:

['dd.ddd.com', 'rp.ff.com', 'usatoday.net']

I am new to regex so any changes in the regexes above would help.

This is an updated version on my scirp

Coder123
  • 334
  • 6
  • 26

1 Answers1

1

This uses finditer() and group() on each match

import re

regex = r"([\w_-]+\.)+[\w_-]+"

test_str = "usatoday.com detroitnews.com virust.com"\
           "ajkdfabbbbbbb.net ha.box.sk www.test.net rp.fff.com"\
           "dd.ddd.com rp.ff.com"

matches = re.finditer(regex, test_str, re.MULTILINE)

grouped = [ match.group() for match in matches ]

print(grouped)

Output:

['usatoday.com', 'detroitnews.com', 'virust.com',
 'ajkdfabbbbbbb.net', 'ha.box.sk', 'www.test.net',
 'rp.fff.com', 'dd.ddd.com', 'rp.ff.com']
xvan
  • 4,554
  • 1
  • 22
  • 37