Regex for filenames and domain names-Python

Question

I have a list of domain names like this:

usatoday.com
detroitnews.com
virust.com
ajkdfabbbbbbb.net
ha.box.sk
www.test.net
rp.fff.com

I am trying to write a regex to be able to match all of the said domains.

For the domains, here is my regex but it doesn't work that well:

import re
s='dd.ddd.com rp.ff.com usatoday.net'
d= re.compile(r'(?<!\S)(([a-zA-Z]{1})|([a-zA-Z]{1}[a-zA-Z]{1})|([a-zA-Z]{1}[0-9]{1})|([0-9]{1}[a-zA-Z]{1})|([a-zA-Z0-9][a-zA-Z0-9-_]{1,61}[a-zA-Z0-9]))\.([a-zA-Z]{2,6}|[a-zA-Z0-9-]{2,30}\.[a-zA-Z]{2,3})(?!\S)')

result = d.findall(s)
print(result)

Output:

[('dd', '', 'dd', '', '', '', 'ddd.com'), ('rp', '', 'rp', '', '', '', 'ff.com'), ('usatoday', '', '', '', '', 'usatoday', 'net')]

I need the output to be:

['dd.ddd.com', 'rp.ff.com', 'usatoday.net']

I am new to regex so any changes in the regexes above would help.

This is an updated version on my scirp

Why do you escape the `\C` ? `reg1=re.compile(r'\bC\:\S+\b')` For the second pattern you use anchors, compile with `re.MULTILINE` Like `reg2=re.compile(r'^(([a-zA-Z]{1})|([a-zA-Z]{1}[a-zA-Z]{1})|([a-zA-Z]{1}[0-9]{1})|([0-9]{1}[a-zA-Z]{1})|([a-zA-Z0-9][a-zA-Z0-9-_]{1,61}[a-zA-Z0-9]))\.([a-zA-Z]{2,6}|[a-zA-Z0-9-]{2,30}\.[a-zA-Z]{2,3})$', re.MULTILINE)` — The fourth bird, Jun 10 '20 at 19:10
see edits. By doing the domain regex, it is giving me no matches — Coder123, Jun 10 '20 at 19:41
It matches but doesn't output the full match and each instance. — Coder123, Jun 10 '20 at 19:43
Did you test the pattern for that string? https://regex101.com/r/q8O17O/1 — The fourth bird, Jun 10 '20 at 19:44
Then you could omit the anchors and use whitespace boundaries https://regex101.com/r/Mw6KOb/1 — The fourth bird, Jun 10 '20 at 19:49
Check output now above. It splits the results. I want it to output each domain name on its own. — Coder123, Jun 10 '20 at 19:54
Yes, that is due to all the capturing groups and re.findall, this page explains why that is https://stackoverflow.com/questions/31915018/re-findall-behaves-weird You could turn them into non capturing groups `(?:` — The fourth bird, Jun 10 '20 at 19:56
So would i just add (?: at the beggining or end the the regex? — Coder123, Jun 10 '20 at 19:57
I did not check the pattern logic, but you can remove `{1}` from the pattern https://ideone.com/RKm5rv — The fourth bird, Jun 10 '20 at 20:07

xvan · Accepted Answer · 2020-06-10T20:23:29.130

This uses finditer() and group() on each match

import re

regex = r"([\w_-]+\.)+[\w_-]+"

test_str = "usatoday.com detroitnews.com virust.com"\
           "ajkdfabbbbbbb.net ha.box.sk www.test.net rp.fff.com"\
           "dd.ddd.com rp.ff.com"

matches = re.finditer(regex, test_str, re.MULTILINE)

grouped = [ match.group() for match in matches ]

print(grouped)

Output:

['usatoday.com', 'detroitnews.com', 'virust.com',
 'ajkdfabbbbbbb.net', 'ha.box.sk', 'www.test.net',
 'rp.fff.com', 'dd.ddd.com', 'rp.ff.com']

Regex for filenames and domain names-Python

1 Answers1