I am in need of assistance in writing a regex query to extract all the website addresses in a log file. Each line of the log file contains a bunch of info (IP address, protocol, bytes, requested website, etc...).
Specifically, I would like to strip out anything that starts with "http://" and ends in specific ".ENDING" where I specify "ENDING = com, biz, net, tv, info" I do not care about the full url (ie: http : // www.google.com/bla/page2=blablabla, simply http://www.google.com). The harder part of this regex query is I want it to pick up on domains that contain .com or .info or .biz as a subdomain (ie: http : // www.google.com.MaliciousWebsite.com) Is there any way to catch the full domain instead of chopping it short at google.com in this situation?
I have never written a regex query before so I have tried to use an online reference chart (http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/) but am struggling. Here is what I have so far:
"\A[http://]\Z[\.][com,info,biz,tv,net]"
*sorry for the spacing in the URLs but stackoverflow is flagging them and I can only post a max of 2 since I am new.
Thank you for the help.
UPDATED: Based on the excellent feedback from everyone so far I think it would be better to write this rule so that it picks up on everything between (http OR https) and (non-valid URL character: ?,!,@,#,$,%,^,&,*,(,),[,{,},],|,/,',",;,<,>)
This will ensure that all TLDs are grabbed and that webistes such as google.com.bad.website.com are also grabbed. Here is my mockup so far:
"\A[https?://]'?!(!@#$%^&*()-=[]{}|\'";,<>)"
Thanks again for all the help.