How to match only the domain part of a URL with regex?

Question

I'm writing a Python code that would process a block of text which, among the text useless for me, features URLs. Out of the text block I only need the domains, not the full URLs. Example input:

47.91.158.176 or 54.145.185.110 port 80 - gooolgeremf.top - GET /search.php
47.90.205.113 or 35.187.59.173 port 80 - voperforseanx.top/site/chrome_update.html

So here I need only gooolgeremf.top and voperforseanx.top matched but the regex I've written will also match search.php and chrome_update.html.

What I'm thinking is that the regex should stop matching after /. However I don't know how to implement it and especially how to not prevent matching domains that appear after the first / in the whole text file.

The way it works so far in my code:

regexdm="[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}"
dmsc=re.findall(regexdm, iocsd.read())

Do you mean FQDN/hostname, or the domain name only ? – mootmoot Mar 27 '17 at 13:34 — mootmoot, Mar 27 '17 at 13:34

score 2 · Accepted Answer · answered Mar 27 '17 at 13:24

2

I'd suggest adding delimiter conditions. Provided a domain name may be only surrounded by spaces, start/end of line and two forward slashed before the domain and one slash after it, the regex would be:

(?: |//|^)([A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,})(?: |/|$)

Demo: https://regex101.com/r/TQKlDP/1

answered Mar 27 '17 at 13:24

Dmitry Egorov

9,542
3
22
40

1

RaminNietzsche: that's because the original regex disallows such domain names (i.e. ones having less than three letters in the TLD). I admit that it looks like the domain name filter is too restrictive and even erroneous but I prepared an answer in assumption that such restrictions are intentional (or at least acceptable) and the only issue was to distinguish such domain names from other parts of the text. – Dmitry Egorov Mar 27 '17 at 13:47

score 1 · Answer 2 · edited May 23 '17 at 12:32

1

Regex is not the easiest way to do it, you should use urlparse.urlparse:

from urlparse import urlparse
parsed_uri = urlparse('http://voperforseanx.top/site/chrome_update.html')
print parsed_uri.netloc

Gives

voperforseanx.top

But, for reference, here is how to handle URLs with regex: Getting parts of a URL (Regex)

edited May 23 '17 at 12:32

Community

1
1

answered Mar 27 '17 at 13:21

Arount

9,853
1
30
43

netloc will not show the domain, if you remove the 'http://' as shown in question input. return None. – nivhanin Mar 27 '17 at 13:48

score 0 · Answer 3 · edited Mar 29 '17 at 11:28

0

In Python 2.7.13, an alternative way example (depends on the input pattern):

str = "47.90.205.113 or 35.187.59.173 port 80 - voperforseanx.top/site/chrome_update.html"
parsed_uri = str.split()[6].split('/')[0]
print parsed_uri
>> voperforseanx.top

edited Mar 29 '17 at 11:28

Peter Mortensen

30,738
21
105
131

answered Mar 27 '17 at 13:37

nivhanin

1,688
3
19
31

score 0 · Answer 4 · answered Mar 27 '17 at 13:45

0

(\b[\w\.]+\.[a-zA-Z]{2,}\b)(.+)$

In this regex the:

(\b[\w\.]+\.[a-zA-Z]{2,}\b)

part, will match what you are looking for, the rest is scrap. To work, this regex needs a gmi modificator.

answered Mar 27 '17 at 13:45

quAnton

776
6
10

What is a "gmi modificator"? Do you have a reference? – Peter Mortensen Mar 29 '17 at 11:29
http://www.ciaomondo.it/regular-expressions/english-guide.php#flags In this guide there is a simple explanation – quAnton Mar 30 '17 at 09:10

How to match only the domain part of a URL with regex?

4 Answers4