0

The end goal I have is to input a block of text (multiple lines) which contains domains and output just a list of domains.

Example input:

2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top - 

The output I want in this case:

www.hlowdolax.top
www.hjaoopoa.top
www.foolalexas.top
pentsshoperqunity.top

Eventually I found out that the best tool for this purpose is re.findall() and tried to do it this way:

matchedDomains=re.findall(myRegex, fileWithMessyText.read())
print matchedDomains

And in the output I see that it matched all the domains but the result looks like this:

[('www', 'hlowdolax', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'hjaoopoa', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'foolalexas', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('pentsshoperqunity', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('nikesportweardewvv', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('www', 'dpooldoopl', 'a', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('www', 'sosgenerga', 'lz', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('search', 'p', 'h', 'p')]

If that's relevant, here is the regex I use:

([A-Za-z0-9]{1,})\.([A-Za-z0-9]{1,10})\.?([A-Za-z]{1,})\.?([A-Za-z]{1,})

I googled a variety of keywords, tested my regex with pythex.org and learned about a term "match captures" and that it has to do something with "capture groups", but all the advice I found here with using group appears to not be compatible with findall, but if I try to use search or match it only works for the first line and prints the whole line instead of just the match (looks like rambling but I didn't document my wanderings so I don't remember what exactly I've tried). Also intuitively it seems like a workaround to use cycles and match line by line when there is a tool that matches the whole block. Problem is, I don't know how to use it.

I'm not looking for someone to write the code for me but I'm really lost at this point. Is there a way to use findall and output just nicely formatted matches?

skooog
  • 89
  • 2
  • 12

6 Answers6

2

The parenthesis you have in your regex create capturing groups, just remove them:

[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}

Here is a demonstration.

>>> re.findall(r'[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}', s)
['www.hlowdolax.top', 'www.hjaoopoa.top', 'www.foolalexas.top', 
 'pentsshoperqunity.top']
daphtdazz
  • 7,754
  • 34
  • 54
Iron Fist
  • 10,739
  • 2
  • 18
  • 34
  • Thank you, this really makes sense now! As I understood, removing the parentheses is the same as the solution suggested by @daphtdazz i.e. adding ?: at the beginning of each capture group? – skooog Mar 23 '17 at 14:18
  • @aistesk, in your case, yes, but if you are not interested in capturing the groups separately and only on the complete domain, then, IMHO, it looks useless and makes regex looks more complicated. – Iron Fist Mar 23 '17 at 14:23
0

You don't need to use regex for this, instead use split() :

>>> data = '''2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if'''
>>> print(" ".join(data.split()).split()[-4])
www.hlowdolax.top

Explanation:

First you can remove extra spaces from the string and then split it with single space and provide the desired index -4

akash karothiya
  • 5,736
  • 1
  • 19
  • 29
0

Just don't capture the groups:

myRegex = '(?:[A-Za-z0-9]{1,})\.(?:[A-Za-z0-9]{1,10})\.?(?:[A-Za-z]{1,})\.?(?:[A-Za-z]{1,})'

The ?: at the beginning of the group says "don't capture me".

And as per the docs if there are no capturing groups it returns a list of strings which matched the pattern.

daphtdazz
  • 7,754
  • 34
  • 54
0

The solution using str.split() and re.split() functions:

import re

s = '''
2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top -
'''

result = [re.split(r'\s+', l)[5] for l in s.strip().split('\n')]

print(result)

The output:

['www.hlowdolax.top', 'www.hjaoopoa.top', 'www.foolalexas.top', 'pentsshoperqunity.top']
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0

If you still want to use that regex, you should retrieve every 'entire match'. It can be done with regex.search(). This documentation will help you. It returns a match object for the first match and its group(0) is entire match. Documentation here. So below is full code based on your regex.

import re

number = """2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top -"""

whole = re.compile("([A-Za-z0-9]{1,})\.([A-Za-z0-9]{1,10})\.?([A-Za-z]{1,})\.?([A-Za-z]{1,})")

m = whole.search(number)
output = []
while m:
    t = m.group(0)
    output.append(t)
    m = whole.search(number, number.find(t)+len(t))

print(output)
# ['www.hlowdolax.top', 'www.hjaoopoa.top', 'www.foolalexas.top', 'pentsshoperqunity.top']
Sangbok Lee
  • 2,132
  • 3
  • 15
  • 33
0

In your case, all websites were wrapped by '-', so try this:

number = """2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top -"""

re.findall(r'.*-(.*)-.*',number)
Shenglin Chen
  • 4,504
  • 11
  • 11