1

I am in need of assistance in writing a regex query to extract all the website addresses in a log file. Each line of the log file contains a bunch of info (IP address, protocol, bytes, requested website, etc...).

Specifically, I would like to strip out anything that starts with "http://" and ends in specific ".ENDING" where I specify "ENDING = com, biz, net, tv, info" I do not care about the full url (ie: http : // www.google.com/bla/page2=blablabla, simply http://www.google.com). The harder part of this regex query is I want it to pick up on domains that contain .com or .info or .biz as a subdomain (ie: http : // www.google.com.MaliciousWebsite.com) Is there any way to catch the full domain instead of chopping it short at google.com in this situation?

I have never written a regex query before so I have tried to use an online reference chart (http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/) but am struggling. Here is what I have so far:

"\A[http://]\Z[\.][com,info,biz,tv,net]"

*sorry for the spacing in the URLs but stackoverflow is flagging them and I can only post a max of 2 since I am new.

Thank you for the help.

UPDATED: Based on the excellent feedback from everyone so far I think it would be better to write this rule so that it picks up on everything between (http OR https) and (non-valid URL character: ?,!,@,#,$,%,^,&,*,(,),[,{,},],|,/,',",;,<,>)

This will ensure that all TLDs are grabbed and that webistes such as google.com.bad.website.com are also grabbed. Here is my mockup so far:

"\A[https?://]'?!(!@#$%^&*()-=[]{}|\'";,<>)"

Thanks again for all the help.

user662772
  • 11
  • 1
  • 4
  • There are other extensions besides the ones you've listed (.gov and .edu, for example). Do you only want to capture those? – Justin Morgan - On strike Mar 16 '11 at 16:08
  • I didn't list all of them but I will be making a thorough sweep to grab all TLD extensions, as you listed gov,edu,tv,net,etc... – user662772 Mar 16 '11 at 16:32
  • That's good, but bear in mind that there are a lot of them and they change periodically. What about IP addresses or special domain names? `http://192.168.0.1` is valid, as is `http://localhost`. There are also port numbers to consider (i.e. `http://example.com:8080`), I don't know whether you want to capture those or not. IMHO you should just grab everything until the first character that's not allowed in a domain name. – Justin Morgan - On strike Mar 16 '11 at 16:59
  • That might make everything much simpler as you suggest to simply grab everything up until a character that is not allowed. so an ending of ?![\?|\=|\@|\#|\$|] should work? – user662772 Mar 16 '11 at 17:04
  • In that case, my answer should work for you. BTW - Unless you're using a regex flavor I'm not familiar with, `[]`, `\Z`, and `,` don't do what you think they do. The idea of your above example seems to be `\A(http://)(.+?)\.(com|info|biz|tv|net)`, but I'm not sure if that's where you're going with the `\Z`. – Justin Morgan - On strike Mar 16 '11 at 17:05
  • I'm not sure what flavor of regex is enabled, I will have to talk to my counterpart. I thought the data was a string so I used `\A` and `\Z`, but it may be different. I will find out later and was hoping that would be a minor change to the query. – user662772 Mar 16 '11 at 17:08
  • Regarding the `[^?=@#$|]` - that should work, yes. I'm not sure of the full list of characters that aren't allowed in a domain name, though. However, I suspect this might be overkill. `http://example.com=blah` isn't a valid URL anyway, so if you already know all the URLs in your target text are valid, you shouldn't need to worry about it. – Justin Morgan - On strike Mar 16 '11 at 17:13
  • The `\A` makes sense where you've put it, but the `\Z` doesn't. That matches the end of the line, but you have characters after it. Also, you're using `[]` where you should be using `()`, and `,` where you'd normally use `|`. What language are you using? Check out http://www.regular-expressions.info/reference.html – Justin Morgan - On strike Mar 16 '11 at 17:15
  • All the URLs will be valid, but I don't care for anything past the TLD. I checked the website you linked but I didn't see where I could determine what flavor of regex I am using. – user662772 Mar 16 '11 at 17:25
  • Sorry, the link was just meant as general regex reference. Most regex are pretty similar, and I've never seen one that uses the `[foo,bar,baz]` construct you have here. If that's what your engine uses, I suspect it isn't really regex. Also, if the URLs are all valid you shouldn't need the `=@$` etc. in there; a simple `[^?#/\s\r]` should get the job done. Edited my answer slightly. – Justin Morgan - On strike Mar 16 '11 at 17:33
  • If you're using PCRE, you definitely want the more standard syntax. – Justin Morgan - On strike Mar 16 '11 at 17:34
  • I don't understand what you mean when you say, "if they are valid you shouldn't need =@$" I updated the question description based on the answers I've received so far. As you suggested, I'd rather grab everything but it cut off after the first non-valid url character. ie http://www.google.com.bad.site.com/search? should just cut down to just http://www.google.com.bad.site.com Does yours do that? Sorry for all the questions but I am learning this as this post gets updated by all. – user662772 Mar 16 '11 at 17:48
  • *** non-valid url character meaning character beyond the last TLD or port number. Maybe calling it non-valid is the wrong term (in my context). – user662772 Mar 16 '11 at 17:50
  • Mine should do that, yes. But it's not looking for non-URL characters, it's looking for non-domain-name characters. It should grab the `http://domain-name` part and ignore the rest. Specifically, it's looking for either whitespace (`\s`) or the characters `/`, `?`, and `#`, which would normally signify the end of the domain-name portion inside a URL. This assumes your URLs all have whitespace after them. If there's something else at the end of your URLs, replace the `\s` with whatever that is. – Justin Morgan - On strike Mar 16 '11 at 19:09

6 Answers6

0

Not sure what regex language you're using, so I'll go with .NET syntax. How about:

@"^https?://[^?/#\s\r]+"

It's not perfect, but the real spec for domain names is a beast, and the presence of http:// or https:// should be enough to tell you there's a domain name on the way.

The ? and # inside the character class should be fine, but I haven't had a chance to check it. You might need to escape them with a \.

Also, this will capture port numbers as well. If you don't want that, add : to the negated character class.


Edit: The PCRE version should be something like this:

^https?:\/\/[^?\/#\s\r]+

I haven't used PCRE recently, though, so you might want to check that with someone who has. I'm not sure which characters need to be escaped inside a character class in PCRE.

Community
  • 1
  • 1
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
0

You can try this expresion:

\b((?:http://)(?:.)*(?:\.)(?:com|info|biz|tv|net))

and you can take a look of the description here :)

r"""
\b               # Assert position at a word boundary
(                # Match the regular expression below and capture its match into backreference number 1
   (?:              # Match the regular expression below
      http://          # Match the characters “http://” literally
   )
   (?:              # Match the regular expression below
      .                # Match any single character that is not a line break character
   )*               # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:              # Match the regular expression below
      \.               # Match the character “.” literally
   )
   (?:              # Match the regular expression below
                       # Match either the regular expression below (attempting the next alternative only if this one fails)
         com              # Match the characters “com” literally
      |                # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         info             # Match the characters “info” literally
      |                # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
         biz              # Match the characters “biz” literally
      |                # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
         tv               # Match the characters “tv” literally
      |                # Or match regular expression number 5 below (the entire group fails if this one fails to match)
         net              # Match the characters “net” literally
   )
)
"""
SubniC
  • 9,807
  • 4
  • 26
  • 33
0

this will catch http or https followed by :// and a domain name not containing space or slash.
note that there are some flawors of regex for various programming languages. you may need to escape the / by \/ or in Java you have to double \ by \\

https?://[^ /]+\.(?:com|info|biz|tv|net)
bw_üezi
  • 4,483
  • 4
  • 23
  • 41
0
^http\:\/\/(.+)\.(com|info|biz|tv|net)

will catch all domains in the http realm ending in the specified tld, but also everything like: http://test.commercial.ly as well. I didn't add an ending slash since I'm not sure if you will always have an ending slash or not on the domain, but if you do always have an ending slash on the domain, you can simple add a / to the end of the regex. If you don't always have an ending slash, that could give you some false positives. You could also add https support if you wanted. Are you sure you want to specify the tld's? or would you want to grab any tld's?

Francis Lewis
  • 8,872
  • 9
  • 55
  • 65
  • I would prefer to grab ANY tlds. I didn't think there was a way to express that so I thought I would have to enter them all manually. – user662772 Mar 16 '11 at 17:00
  • something like ^http\:\/\/(.+)\.([a-z]{2,4})/ would grab all domains with any tld. Using the [a-z]{2,4} selects any character from a-z with a length from 2-4 characters. I'm not sure if there's any tld's greater than 4 characters, but if there are, just adjust that part. – Francis Lewis Mar 16 '11 at 17:34
0

\A[http://]\Z[\.][.*][com,info,biz,tv,net]?![\.]

Not sure what type of regex you're using, but it would seem that you're trying to find the point of an address that includes BOTH ".com, net,etc." AND "/", or more specific might be: ends in .com and does NOT precede another '.'

So .com.com isn't valid, but .com/, or .com would be

Dawson
  • 7,567
  • 1
  • 26
  • 25
  • Yes, the whole point is to extract the requested domain. However, I have seen domain requests that are disguised to confuse users into believing it is a legitimage site by using http://www.google.com.malware.badguywebsite.info. I want to capture the whole string and not have it see www.google.com and cut off the rest of the domain. Does that make sense? – user662772 Mar 16 '11 at 16:52
  • Ahh I think I see what you mean. Correct, a .com alone would not be a good enough rule. It needs to check to make sure it is .com AND NOT followed by [.] again. – user662772 Mar 16 '11 at 16:55
0

Umm hello user662772:

Okay, I'm not trying to be snarky but have you consider using awk? It will split your log file up into fields and then you can simply print the field you are after. Bonus Awk does regular expression pattern matching and substitution.

But you were asking about regexs:

I'm using Perl's regular expressions:

http.*(\.com|\.org|\.net)

woops had to double escape the backslashes.

gymnodemi
  • 61
  • 1
  • I do have access to pearl queries, so I am open to either/or. I am not familiar with either so I do not know which would be easier. As such, I simply started researching regex. Can you provide it in pearl using awk? – user662772 Mar 16 '11 at 17:03