0

I've got some problems with regexp function in hive. I hope to find the top level domain in the list examples:

www.whatever.com
www.iam.com.uk
mobile.who.com.us

in this case, i should get the result of "whatever, iam, who" so I choose to look reverse, and write regular expression:

*\.([a-z]+)\.([a-z]+)+(\.[a-z]+)?$

meaning that I only want to look at the last 2 or 3 items in the url. But it returned a dangling error. Any help appreciated!

Qi Nie
  • 11
  • 3
  • 1
    Do you mean `^` instead of `*` for the first character in the regex? `^` means start-of-string. Or maybe you mean `.*` to match any number of characters before the first `.`? Also you have an unclosed group. – Rusty Shackleford Jul 22 '15 at 01:54
  • I tried to put .* at the beginning, but that return url like 'm.whatever.com' which means it only replace one character instead of any number at front – Qi Nie Jul 22 '15 at 02:01
  • `^\w+\.(\w+)(?:\.\w+)+$` will give you the expected result in group 1 for the examples you provided. However, it is not a general regex for extracting the top-level domain from a hostname. – Rusty Shackleford Jul 22 '15 at 02:10

1 Answers1

-1

You seem to want to get the 2nd portion of the URL always. So why not do that directly?

[a-z]+?\.([a-z]+)?\.
  1. This checks for 1 or more characters lazily.
  2. Then a . dot.
  3. Then it captures 1 or more characters lazily until a dot.

DEMO

Codebender
  • 14,221
  • 7
  • 48
  • 85
  • The reason I didn't do that is there could be a possibility when "mobile.sports.whatever.com" happens – Qi Nie Jul 22 '15 at 02:32
  • @QiNie: If your goal really is to extract the domain name (and not simply fix the error in your original regex), please see [here](http://stackoverflow.com/questions/863297/regular-expression-to-retrieve-domain-tld) and [here](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url) for ideas. – Rusty Shackleford Jul 22 '15 at 02:53