3

I am trying to figure out the best regex to simply match only the last two strings in a url.

For instance with www.stackoverflow.com I just want to match stackoverflow.com

The issue i have is some strings can have a large number of periods for instance

a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com 

should also return only yimg.com

The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.

What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com under the condtions above?

laxonline
  • 2,657
  • 1
  • 20
  • 37
user7980
  • 703
  • 3
  • 15
  • 28

4 Answers4

3

You don't have to use regex, instead you can use a simple explode function.

So you're looking to split your URL at the periods, so something like

$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);

And then you need to get the last two elements, so you can echo them out from the array created.

//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];

So in the end you'll get yimg.com as the final output.

Hope this helps.

Charles
  • 4,372
  • 9
  • 41
  • 80
1

if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP

$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";

preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);

This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces

yimg.com

as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you

stackoverflow.com

as a result

akhilless
  • 3,169
  • 19
  • 24
1

I don't know what did you try so far, but I can offer the following solution:

/.*?([\w]+\.[\w]+)$/

There are a couple of tricks here:

  1. Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.

  2. Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).

  3. Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :

    abc.def.gh

the greedy match will give f.gh in your group, and its not what you want.

I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).

I post here a working groovy example, you didn't specify the language you use but the engine should be similar.

def  s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy

Hope this helps

Mark Bramnik
  • 39,963
  • 4
  • 57
  • 97
0

A shorter version

/(\.[^\.]+){2}$/
nachocab
  • 13,328
  • 21
  • 91
  • 149