0

I saw an SO answer here which has the SQL to extract domain from a URL in Redshift. I am very new to Regex. Is it possible to understand the answer step by step?

REPLACE(REGEXP_SUBSTR(url,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+'),'//','')

All I have come to understand so far is ^ means that to match the beginning of a string and anything between square brackets [] is a character set, but I want to understand this regex in and out.

ab_padfoot
  • 63
  • 1
  • 10

1 Answers1

0

Certainly! Let's break down the regular expression step by step:

REGEXP_SUBSTR(url,'//[^/\\\,=@\\+]+\\.[^/:;,\\\\\(\\)]+')

This part of the regular expression is used to extract a portion of the url string. It searches for a pattern starting with // followed by one or more characters that are not /, \, ,, =, @, +, :, ;, (, ), or a dot (.). The [^...] construct represents a negated character class, meaning any character that is not within the square brackets.

For example, given the input https://www.example.com/path, this expression would match //www.example.com.

 REPLACE(matchedString, '//', '')

This part of the regular expression is used to replace the // substring in the matchedString (output of the previous step) with an empty string. It effectively removes the // from the extracted portion of the URL.

Continuing with the previous example, the output would be www.example.com, as the // is replaced with an empty string.

Therefore, when this regular expression is applied to a URL, it extracts the domain name (excluding any protocol or path) and removes any leading // if present.

mario ruiz
  • 880
  • 1
  • 11
  • 28
  • Just to confirm I have understood the break down correctly, the first [^..] matches the pattern till the first dot is found. So, https://www.example.com/path would fetch //www. How is that using just one dot its able to fetch the entire pattern with multiple dots (//www.example.com) – ab_padfoot Jun 22 '23 at 04:58