The end goal I have is to input a block of text (multiple lines) which contains domains and output just a list of domains.
Example input:
2017-03-02: 173.254.221.115 port 80 - www.hlowdolax.top - GET /usp?f=1if
2017-03-02: 173.254.221.115 port 80 - www.hjaoopoa.top - GET /uf=1if
2017-03-04: 173.254.221.115 port 80 - www.foolalexas.top - GET /userif
2017-03-04: 54.202.16.39 port 80 - pentsshoperqunity.top -
The output I want in this case:
www.hlowdolax.top
www.hjaoopoa.top
www.foolalexas.top
pentsshoperqunity.top
Eventually I found out that the best tool for this purpose is re.findall()
and tried to do it this way:
matchedDomains=re.findall(myRegex, fileWithMessyText.read())
print matchedDomains
And in the output I see that it matched all the domains but the result looks like this:
[('www', 'hlowdolax', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'hjaoopoa', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('www', 'foolalexas', 'to', 'p'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('pentsshoperqunity', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('nikesportweardewvv', 't', 'o', 'p'), ('search', 'p', 'h', 'p'), ('www', 'dpooldoopl', 'a', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('www', 'sosgenerga', 'lz', 'top'), ('user', 'p', 'h', 'p'), ('1', 'g', 'i', 'f'), ('fordfocuscommunoityesz', 't', 'o', 'p'), ('search', 'p', 'h', 'p')]
If that's relevant, here is the regex I use:
([A-Za-z0-9]{1,})\.([A-Za-z0-9]{1,10})\.?([A-Za-z]{1,})\.?([A-Za-z]{1,})
I googled a variety of keywords, tested my regex with pythex.org and learned about a term "match captures" and that it has to do something with "capture groups", but all the advice I found here with using group
appears to not be compatible with findall
, but if I try to use search
or match
it only works for the first line and prints the whole line instead of just the match (looks like rambling but I didn't document my wanderings so I don't remember what exactly I've tried). Also intuitively it seems like a workaround to use cycles and match line by line when there is a tool that matches the whole block. Problem is, I don't know how to use it.
I'm not looking for someone to write the code for me but I'm really lost at this point. Is there a way to use findall
and output just nicely formatted matches?