0

I'm using Hive as a database and I need a SQL method.

I want to replace an URL with the domain name. So for instance when I have http://info.myurl.com/some/path/into/the/server, I want to have myurl.com. But when I have https://myurl.com/ I also want to get myurl.com.

I've found the regexp_extract function in Hive, so I am this far:

select regexp_extract('http://info.myurl.com/some/path/into/the/server', 'http[s]?://.*?\.([^/]*)', 1) from dual;

Now the difficult part: how to know if there are 1 or 2 dots in the domain, so I know if I have to strip the first subdomain?

Sven
  • 2,343
  • 1
  • 18
  • 29

1 Answers1

0

In the end I used the following result:

select
       (case cardinality(split(url_extract_host('http://www.s3.info.test.co.uk/'), '.')) * 10 + length(reverse(split(url_extract_host('http://www.s3.info.test.co.uk/'), '.'))[1])
            when 42 then regexp_extract(url_extract_host('http://info.test.co.uk/'), '[^.]*\.([^/]*)', 1)
            when 52 then regexp_extract(url_extract_host('http://www.info.test.co.uk/'), '(?:[^.]*\.){2}([^/]*)', 1)
            when 62 then regexp_extract(url_extract_host('http://www.s3.info.test.co.uk/'), '(?:[^.]*\.){3}([^/]*)', 1)
            when 33 then regexp_extract(url_extract_host('http://info.test.com/'), '[^.]*\.([^/]*)', 1)
            when 43 then regexp_extract(url_extract_host('http://www.info.test.com/'), '(?:[^.]*\.){2}([^/]*)', 1)
            when 53 then regexp_extract(url_extract_host('http://www.s3.info.test.com/'), '(?:[^.]*\.){3}([^/]*)', 1)
            else url_extract_host('http://test.com/')
           end) as domain
from dual;

Everywhere an URL is mentioned, it should be the database field that needs to be checked. The case depends on the number of characters of the tld (2 or 3) and the number of elements between dots.

I'm not sure it is the most efficient solution, but it works great.

Sven
  • 2,343
  • 1
  • 18
  • 29