6

I was wondering if someone out there could help me with a regex in C#. I think it's fairly simple but I've been wracking my brain over it and not quite sure why I'm having such a hard time. :)

I've found a few examples around but I can't seem to manipulate them to do what I need.

I just need to match ANY alphanumeric+dashes subdomain string that is not "www", and just up to the "."

Also, ideally, if someone were to type "www.subdomain.domain.com" I would like the www to be ignored if possible. If not, it's not a huge issue.

In other words, I would like to match:

  • (test).domain.com
  • (test2).domain.com
  • (wwwasdf).domain.com
  • (asdfwww).domain.com
  • (w).domain.com
  • (wwwwww).domain.com
  • (asfd-12345-www-bananas).domain.com
  • www.(subdomain).domain.com

And I don't want to match:

  • (www).domain.com

It seems to me like it should be easy, but I'm having troubles with the "not match" part.

For what it's worth, this is for use in the IIS 7 URL Rewrite Module, to rewrite for all non-www subdomains.

Thanks!

trnelson
  • 2,715
  • 2
  • 24
  • 40
  • 3
    Do you know how to **match exactly** all the strings you **don't** want? Hint: if you do, your problem is solved, just invert your boolean logic around the match. – Mat Aug 17 '11 at 21:01

7 Answers7

9

Is the remainder of the domain name constant, like .domain.com, as in your examples? Try this:

\b(?!www\.)(\w+(?:-\w+)*)(?=\.domain\.com\b)

Explanation:

  • \w+(?:-\w+)* matches a generic domain-name component as you described (but a little more rigorously).

  • (?=\.domain\.com\b) makes sure it's the first subdomain (i.e., the last one before the actual domain name).

  • \b(?!www\.) makes sure it isn't www. (without the \b, it could skip over the first w and match just the ww.).

In my tests, this regex matches precisely the parts you highlighted in your examples, and does not match the www. in either of the last two examples.


EDIT: Here's another version which matches the whole name, capturing the pieces in different groups:

^((?:\w+(?:-\w+)*\.)*)((?!www\.)\w+(?:-\w+)*)(\.domain\.com)$

In most cases, group $1 will contain an empty string because there's nothing before the subdomain name, but here's how it breaks down www.subdomain.domain.com:

$1: "www."
$2: "subdomain"
$3: ".domain.com"
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • I think you need to swap the ?= for a ?: , then it works in http://www.regexplanet.com/simple/index.html . I'm not familiar with ?= - maybe it's only supported on some engines. If you also use ^ and $ at start and end, then it's a better answer than mine.. – laher Aug 18 '11 at 01:30
  • 1
    I'm not trying to match the whole URL, just the portions that were highlighted in the examples. That's why I used a [lookahead](http://www.regular-expressions.info/lookaround.html) there, and why I *didn't* use the anchors (`^` and `$`). – Alan Moore Aug 18 '11 at 02:26
  • Alan, thanks for your reply. This seems to work really well for the match, but I'm not sure how to group the subdomain for a replacement. My regex is a bit weak. Would you have any thoughts? – trnelson Aug 18 '11 at 02:27
  • I actually ended up going with this and it seems to be working fine except for www.sub.domain.com, which is fine for now. Thank you so much for your guidance!! ^\b(?!www\.)(\w+(?:-\w+)*)(?:\.example\.com)$ – trnelson Aug 18 '11 at 02:44
  • 1
    @trnelson: Check out my edit; you might find that regex easier to work with. – Alan Moore Aug 18 '11 at 03:35
  • Thanks! Just crawled into bed but will take a look at this tomorrow. Really appreciate your help! – trnelson Aug 18 '11 at 05:34
2
^www\.

And invert the logic for this bit, so if it matches, then your string does not meet your requirements.

mopsled
  • 8,445
  • 1
  • 38
  • 40
2

This works:

^(?!www\.domain\.com)(?:[a-z\-\.]+\.domain\.com)$

Or, with the necessary backslashes for Java (or C#?) strings:

"^(?!www\\.domain\\.com)(?:[a-z\\-\\.]+\\.domain\\.com)$"

There may be a more concise way (i.e. only typing domain.com once), but this works ..

laher
  • 8,860
  • 3
  • 29
  • 39
1

Just substitute the original with everything after the www, if present (pseudocode):

str = re.sub("(www\.)?(.+)", "\2", str)

Or if you just want to match those which are "wrong" use this:

(www\.([^.]+)\.([^.]+))

And if you must match all those which are good use this:

(([^w]|w[^w]|ww[^w]|www[^.]|www\.([^.]+)\.([^.]+)\.).+)
orlp
  • 112,504
  • 36
  • 218
  • 315
1

Just thinking aloud here:

^(?:www\.)?([^\.]+)\.([^\.]+)\.

where...

  • (?:www\.)? looks for a possible "www" at the start, non-capturing
  • ([^\.]+)\. looks for the sub-domain (anything except a dot at least once until a dot)
  • ([^\.]+)\. looks for the domain, ending with a dot (anything except a dot at least once until a dot)

Note: This expression will not work with double sub-domains: www.subsub.sub.domain.com

John McDonald
  • 1,790
  • 13
  • 20
  • Kind of running with this idea, this is what I've come up with so far: (?:www\.)?(?:www)?([\w\-]*)\.example\.com This seems to work for: www.subdomain.example.com. But still doesn't seem work for: www.example.com. – trnelson Aug 17 '11 at 22:33
1

This:

^(?:www\.)?([^.]*)

It matches exactly what you put in parentheses in your question. You will find your answers sitting in group(1). You have to anchor it to the beginning of the line. Use this:

^(?:www\.)?(.*)    

If you want everything in the URL except the "www.". One example you did not include in your test cases was "alpha.subdomain.domain.com". In the event you need to match everything, except "www.", that is not in the "domain.com" part of the string, use this:

^(?:www\.)?(.+)((?:\.(?:[^./\?]+)){2})

It will solve all of your cases, but in addition, will also return "alpha.subdomain" from my additional test case. And, for an encore, places ".domain.com" in group 2 and will not match beyond that if there are directories or parameters in the url.

I verified all of these responses here.

Finally, for the sake of overkill, if you want to reject addresses that begin with "www.", you can use negative lookbehind:

^....(?<!www\.).*
Michael Hays
  • 6,878
  • 2
  • 21
  • 17
0

Thought i'd share this.

(\\.[A-z]{2,3}){1,2}$

Removes any '.com.au' '.co.uk' from the end. Then you can do an additional lookup to detect whether a URL contains a subdomain.

E.g.

subdaomin1.sitea.com.au
subdaomin2.siteb.co.uk
subdaomin3.sitec.net.au

all become:

subdomain1.sitea
subdomain2.siteb
subdomain3.sitec

Baby Groot
  • 4,637
  • 39
  • 52
  • 71
Mathew
  • 11
  • 2