2

I have a table with column called URL. I am trying to extract different parts of URL using REGEX_SUBSTR function in Redshift.

URL                                             Expected_output
------------                                 
www.abcd.com/first-name                      |   first-name
www.abcd.com/another-name/some-details/other |   another-name
www3.abcd.com/some-name/                     |   some-name
form.abcd.com/another-first-name             |   another-first-name

So the intention is to extract the first slug after the domain name

I tried
SELECT REGEXP_SUBSTR('www.abcd.com/slug-name', '(www|www3|form)[.]abcd[.][^/]+/([^/#?]+)',1,2)

Output
www.abcd.com/slug-name

Neil
  • 1,715
  • 6
  • 30
  • 45
  • What if you use `REGEXP_SUBSTR('www.abcd.com/slug-name', '[^/]+',1,2)`? Or do you need to match only in specific context? – Wiktor Stribiżew Nov 20 '19 at 07:56
  • Please provide feedback on exact context you want a match to occur, it is important here since this regex engine does not seem to support non-capturing groups. Do you need to check for `(www|www3|form)`? – Wiktor Stribiżew Nov 20 '19 at 08:02
  • Yes, I would need to check that this is a valid URL starting with `www|www3|form` – Neil Nov 20 '19 at 08:07
  • Then I do not think you can use `REGEXP_SUBSTR` since it can only extract the *first* capturing group contents, and the others are ignored, and since it uses POSIX regex standard, no non-capturing groups are supported. You need to use a two-step solution. – Wiktor Stribiżew Nov 20 '19 at 08:10
  • Got it. By 2 step do you mean using a UDF? – Neil Nov 20 '19 at 08:13
  • No idea, I just checked the [docs](https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html) and see that the `REGEXP_SUBSTR` solution won't work for you even if you add `e` as the parameters option. – Wiktor Stribiżew Nov 20 '19 at 08:15
  • Try `REGEXP_REPLACE('www.abcd.com/slug-name', '^(www3?|form)\\.abcd\\.[^/]+/([^/#?]+).*', '\\2')` – Wiktor Stribiżew Nov 20 '19 at 08:17
  • Is `\\2` intended? It gives me syntax error? – Neil Nov 20 '19 at 08:22
  • The point is that we need a backreference to Group 2 in the replacement pattern. If `'\\2'` does not work, try `'$2'`. The docs do not contain any hint on the backreference syntax. – Wiktor Stribiżew Nov 20 '19 at 08:24
  • Understood. No. It only expects int value. Hence it doesn't work. Anyway, thanks for you help. Appreciate it. – Neil Nov 20 '19 at 08:27
  • Hm, but [this answer](https://stackoverflow.com/a/50711245/3832970) suggests `'$2'` syntax. Why `int`? Are you trying it with `REGEXP_SUBSTR`? It is `REGEXP_REPLACE` – Wiktor Stribiżew Nov 20 '19 at 08:29
  • oh sorry, you are right. I was using `REGEX_SUBSTR`. `REGEX_REPLACE` works for this case – Neil Nov 20 '19 at 08:34
  • If there is no match, REGEXP_REPLACE will keep the original string, is it OK or do you want to remove the whole string? – Wiktor Stribiżew Nov 20 '19 at 08:41
  • I guess it should be ok. Currently, I am trying it out with different data that I have and seeing if there are any other edge cases. – Neil Nov 20 '19 at 08:44

2 Answers2

1

Since you need several groupings in the pattern with the first one defining the expected match context (so, you do not need to extract the first group content), and REGEXP_SUBSTR only supports the first group extraction using the e modifier, you have to fall back to REGEXP_REPLACE.

One note: REGEXP_REPLACE removes the matches and if a string has no match, it will remain there as is. REGEXP_SUBSTR would return a blank value.

So, you may consider

REGEXP_REPLACE('www.abcd.com/slug-name', '^(www3?|form)\\.abcd\\.[^/]+/([^/#?]+).*', '$2')

Or, to remove the no match entries text (note the |.+ added at the end):

REGEXP_REPLACE('www.abcd.com/slug-name', '^(www3?|form)\\.abcd\\.[^/]+/([^/#?]+).*|.+', '$2')

So, ^(www3?|form)\\.abcd\\.[^/]+/([^/#?]+).* matches

  • ^ - start of string
  • (www3?|form) - Group 1: www and an optional 3
  • \.abcd\. - .abcd. substring
  • [^/]+ - 1+ chars other than /
  • / - a / char
  • ([^/#?]+) - Group 2 (referred to via the $2 backreference in the replacement pattern): 1 or more chars other than /, # an ?
  • .* - any 0+ chars as many as possible

The |.+ says "or any 1+ chars". That is, remove the whole text if the first alternative found no matches.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

You can ignore anything up to the first forward slash, capture the group after the first and up to the second (if present). In the capture group, you want to get anything with characters (possibly not upper-case), hyphens and underscores. I found '\w-' matches both underscores and hyphens.

Try this expression:

^.*?\/([\w-]+)\/?.*

I tested this on the following strings:

www.abcd.com/first-name                     
www.abcd.com/another-name/some-details/other 
www3.abcd.com/some-name/                    
form.abcd.com/another-first-name
form.abcd.com/another_first-name
form.abcd.com/anotherfirst-name

and I used the tool on Regex 101 to give me these matches:

Match 1
Full match  0-44    www.abcd.com/first-name                     
Group 1.    13-23   first-name
Match 2
Full match  45-90   www.abcd.com/another-name/some-details/other 
Group 1.    58-70   another-name
Match 3
Full match  91-135  www3.abcd.com/some-name/                    
Group 1.    105-114 some-name
Match 4
Full match  136-168 form.abcd.com/another-first-name
Group 1.    150-168 another-first-name
Match 5
Full match  169-201 form.abcd.com/another_first-name
Group 1.    183-201 another_first-name
Match 6
Full match  202-233 form.abcd.com/anotherfirst-name
Group 1.    216-233 anotherfirst-name
  • `Invalid preceding regular expression prior to repetition operator. The error occurred while parsing the regular expression fragment: '^.*?>>>HERE>>>/([w-]+)/?'` – Neil Nov 20 '19 at 07:53
  • Some similar error https://stackoverflow.com/questions/46495580/redshift-regexp-substr – Neil Nov 20 '19 at 07:53
  • Ah, I was afraid there would be a difference in Regex versions. I'm not sure, perhaps try replacing the lazy '.*?' with '[\w.]+', I'm afraid I'm not familiar with Redshift and don't have a way to test it. – James Baker Nov 20 '19 at 07:58
  • @JamesBaker Your regex won't work with OP code. You need to include the usage example. – Wiktor Stribiżew Nov 20 '19 at 08:03