4

I'm trying to form a regular expression for REGEXP_SUBSTR (Redshift) which will extract the sub-domain & domain part from any given URL.

I tried many suggestions from stackoverflow: regular-expression-extract-subdomain-domain, getting-parts-of-a-url-regex, how-to-get-domain-name-from-url and etc. Some of them work on regex validator but don’t work on Redshift.

A regular expression should handle URLs with and without http/https prefix.

Is there any other way of extracting sub-domain & domain from any given URL using a regular expression?

Bikee
  • 1,197
  • 8
  • 21
Pavel D
  • 41
  • 1
  • 2
  • Check this UDF blog post: https://aws.amazon.com/blogs/aws/user-defined-functions-for-amazon-redshift/ – Guy Apr 26 '16 at 04:59

1 Answers1

5

After a ton of experimentation, this is what I use:

REPLACE(REGEXP_SUBSTR(url,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//','')

Need to match the double slash and then remove it with REPLACE because of the quite basic regex supported by Redshift.

FWIW, you'll notice that this is very different from the regex provided by Jeff Barr in the Redshift UDF's intro - that regex produces nothing for me.

Joe Harris
  • 13,671
  • 4
  • 47
  • 54