How to print only the matched regex strings from a block of text?

Question

The end goal I have is to input a block of text (multiple lines) which contains domains and output just a list of domains.

Example input:

2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top -

The output I want in this case:

www.hlowdolax.top
www.hjaoopoa.top
www.foolalexas.top
pentsshoperqunity.top

Eventually I found out that the best tool for this purpose is re.findall() and tried to do it this way:

matchedDomains=re.findall(myRegex, fileWithMessyText.read())
print matchedDomains

And in the output I see that it matched all the domains but the result looks like this:

[('www', 'hlowdolax', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'hjaoopoa', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'foolalexas', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('pentsshoperqunity', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('nikesportweardewvv', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('www', 'dpooldoopl', 'a', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('www', 'sosgenerga', 'lz', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('search', 'p', 'h', 'p')]

If that's relevant, here is the regex I use:

([A-Za-z0-9]{1,})\.([A-Za-z0-9]{1,10})\.?([A-Za-z]{1,})\.?([A-Za-z]{1,})

I googled a variety of keywords, tested my regex with pythex.org and learned about a term "match captures" and that it has to do something with "capture groups", but all the advice I found here with using group appears to not be compatible with findall, but if I try to use search or match it only works for the first line and prints the whole line instead of just the match (looks like rambling but I didn't document my wanderings so I don't remember what exactly I've tried). Also intuitively it seems like a workaround to use cycles and match line by line when there is a tool that matches the whole block. Problem is, I don't know how to use it.

I'm not looking for someone to write the code for me but I'm really lost at this point. Is there a way to use findall and output just nicely formatted matches?

if in file all entries has format as in example, why you simply don;t read files to list, and from each line print only host. Just split each line and print 5th element of list, which was created by spliting line — darvark, Mar 23 '17 at 13:48

score 2 · Accepted Answer · edited Mar 23 '17 at 14:18

2

The parenthesis you have in your regex create capturing groups, just remove them:

[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}

Here is a demonstration.

>>> re.findall(r'[A-Za-z0-9]{1,}\.[A-Za-z0-9]{1,10}\.?[A-Za-z]{1,}\.?[A-Za-z]{1,}', s)
['www.hlowdolax.top', 'www.hjaoopoa.top', 'www.foolalexas.top', 
 'pentsshoperqunity.top']

edited Mar 23 '17 at 14:18

daphtdazz

7,754
34
54

answered Mar 23 '17 at 13:52

Iron Fist

10,739
2
18
34

Thank you, this really makes sense now! As I understood, removing the parentheses is the same as the solution suggested by @daphtdazz i.e. adding ?: at the beginning of each capture group? – skooog Mar 23 '17 at 14:18
@aistesk, in your case, yes, but if you are not interested in capturing the groups separately and only on the complete domain, then, IMHO, it looks useless and makes regex looks more complicated. – Iron Fist Mar 23 '17 at 14:23

akash karothiya · Answer 2 · 2017-03-23T13:51:55.087

0

You don't need to use regex for this, instead use split() :

>>> data = '''2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if'''
>>> print(" ".join(data.split()).split()[-4])
www.hlowdolax.top

Explanation:

First you can remove extra spaces from the string and then split it with single space and provide the desired index -4

edited Mar 23 '17 at 13:51

answered Mar 23 '17 at 13:49

akash karothiya

5,736
1
19
29

What if the `-` was part of the domain name itself !? – Iron Fist Mar 23 '17 at 13:50
this code works with `-` as well :) , try it – akash karothiya Mar 23 '17 at 13:54

score 0 · Answer 3 · answered Mar 23 '17 at 13:50

Just don't capture the groups:

myRegex = '(?:[A-Za-z0-9]{1,})\.(?:[A-Za-z0-9]{1,10})\.?(?:[A-Za-z]{1,})\.?(?:[A-Za-z]{1,})'

The ?: at the beginning of the group says "don't capture me".

And as per the docs if there are no capturing groups it returns a list of strings which matched the pattern.

score 0 · Answer 4 · answered Mar 23 '17 at 13:51

The solution using str.split() and re.split() functions:

import re

s = '''
2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top -
'''

result = [re.split(r'\s+', l)[5] for l in s.strip().split('\n')]

print(result)

The output:

['www.hlowdolax.top', 'www.hjaoopoa.top', 'www.foolalexas.top', 'pentsshoperqunity.top']

score 0 · Answer 5 · answered Mar 23 '17 at 14:11

If you still want to use that regex, you should retrieve every 'entire match'. It can be done with regex.search(). This documentation will help you. It returns a match object for the first match and its group(0) is entire match. Documentation here. So below is full code based on your regex.

import re

number = """2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top -"""

whole = re.compile("([A-Za-z0-9]{1,})\.([A-Za-z0-9]{1,10})\.?([A-Za-z]{1,})\.?([A-Za-z]{1,})")

m = whole.search(number)
output = []
while m:
    t = m.group(0)
    output.append(t)
    m = whole.search(number, number.find(t)+len(t))

print(output)
# ['www.hlowdolax.top', 'www.hjaoopoa.top', 'www.foolalexas.top', 'pentsshoperqunity.top']

score 0 · Answer 6 · answered Mar 23 '17 at 19:11

In your case, all websites were wrapped by '-', so try this:

number = """2017-03-02:  173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02:  173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04:  173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04:  54.202.16.39 port 80 - pentsshoperqunity.top -"""

re.findall(r'.*-(.*)-.*',number)

How to print only the matched regex strings from a block of text?

6 Answers6