1

I'm writing a Python code that would process a block of text which, among the text useless for me, features URLs. Out of the text block I only need the domains, not the full URLs. Example input:

47.91.158.176 or 54.145.185.110 port 80 - gooolgeremf.top - GET /search.php
47.90.205.113 or 35.187.59.173 port 80 - voperforseanx.top/site/chrome_update.html

So here I need only gooolgeremf.top and voperforseanx.top matched but the regex I've written will also match search.php and chrome_update.html.

What I'm thinking is that the regex should stop matching after /. However I don't know how to implement it and especially how to not prevent matching domains that appear after the first / in the whole text file.

The way it works so far in my code:

regexdm="[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}"
dmsc=re.findall(regexdm, iocsd.read())
skooog
  • 89
  • 2
  • 12

4 Answers4

2

I'd suggest adding delimiter conditions. Provided a domain name may be only surrounded by spaces, start/end of line and two forward slashed before the domain and one slash after it, the regex would be:

(?: |//|^)([A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,})(?: |/|$)

Demo: https://regex101.com/r/TQKlDP/1

Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40
  • 1
    RaminNietzsche: that's because the original regex disallows such domain names (i.e. ones having less than three letters in the TLD). I admit that it looks like the domain name filter is too restrictive and even erroneous but I prepared an answer in assumption that such restrictions are intentional (or at least acceptable) and the only issue was to distinguish such domain names from other parts of the text. – Dmitry Egorov Mar 27 '17 at 13:47
1

Regex is not the easiest way to do it, you should use urlparse.urlparse:

from urlparse import urlparse
parsed_uri = urlparse('http://voperforseanx.top/site/chrome_update.html')
print parsed_uri.netloc

Gives

voperforseanx.top

But, for reference, here is how to handle URLs with regex: Getting parts of a URL (Regex)

Community
  • 1
  • 1
Arount
  • 9,853
  • 1
  • 30
  • 43
  • netloc will not show the domain, if you remove the 'http://' as shown in question input. return None. – nivhanin Mar 27 '17 at 13:48
0

In Python 2.7.13, an alternative way example (depends on the input pattern):

str = "47.90.205.113 or 35.187.59.173 port 80 - voperforseanx.top/site/chrome_update.html"
parsed_uri = str.split()[6].split('/')[0]
print parsed_uri
>> voperforseanx.top
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
nivhanin
  • 1,688
  • 3
  • 19
  • 31
0
(\b[\w\.]+\.[a-zA-Z]{2,}\b)(.+)$

In this regex the:

(\b[\w\.]+\.[a-zA-Z]{2,}\b)

part, will match what you are looking for, the rest is scrap. To work, this regex needs a gmi modificator.

quAnton
  • 776
  • 6
  • 10