RegEx matching URLs that are NOT in my domain

Question

I am trying to set up my Netscaler device with a Rewrite Policy. One of my requirements is to replace any non-domain URLs with the home page URL... that is, I want the Netscaler to replace all external links on a page being served from behind the device with the home page's URL (ex: https://my.domain.edu). The type of Rewrite Policy I'm trying to configure uses a PCRE-compliant regex engine to find specific text on a web page (multiple matches possible).

good links:

https://your.page.domain.edu -- won't be replaced  
http://good.domain.edu  -- also won't be replaced

bad links (should be replaced with home page URL):

https://www.google.com    
http://not.the.best.example.org   
http://another.bad.example.erewhon.edu   
https://my.domain.com

I currently have this pattern:

(https?://)(?![\w.-]+\.domain\.edu)

According to the Netscaler's RegEx evaluation tool this matches the bad links above and doesn't match the good links, so it seems to be working... in fact, when I run this on a test page, the Netscaler finds all the URLs I want to replace and leaves the good URLs alone.

The problem is the Netscaler isn't replacing the URLs the way I want: it replaces the (https?://) group with the home page URL but leaves the remaining part of the bad URL. For example, it replaces http://www.google.com with: https://my.domain.eduwww.google.com

I can configure the Rewrite Policy to replace specific URLs (for example, https://www.google.com), so I know the mechanism works. Obviously, this won't work for the general case.

I've tried enclosing the entire regex in parentheses, but this didn't change anything.

Can a regular expression be written for the general case, to match the entire URL for all domains that aren't mine?

Thanks in advance for any help!

what's the source of the links you wanna test and rewrite, what you exactly mean w/ **all external links on a page?**? entire webpage, under specific elements? are invalid urls also possible, such as `http://my--example....domain.org` or `http://!@#$@#$`? what about mail and ftp urls (`mail:`, `ftp://`). also, do these urls contain domain root only or may have path following? what about get params or sections in url (`http://example.com?params`, `http://example.com#label`). if url's do not contain only domain name or subdomain, pattern may be bit longer. there may be various links on the page. — Wh1T3h4Ck5, Jun 01 '18 at 12:32

Allan · Accepted Answer · 2018-06-01T01:27:05.553

You can use the following regex:

^https?:\/\/[\w.-]+(?<!\.domain\.edu)$

with your home page URL as substitution:

https://my.domain.edu

TEST INPUT:

https://www.google.com
http://not.the.best.example.org
http://another.bad.example.erewhon.edu
https://my.domain.com
https://your.page.domain.edu
http://good.domain.edu

TEST OUTPUT:

https://my.domain.edu
https://my.domain.edu
https://my.domain.edu
https://my.domain.edu
https://your.page.domain.edu
http://good.domain.edu

Demo on regex101

If http/https matters than use the following regex:

^(https?:\/\/)[\w.-]+(?<!\.domain\.edu)$

with replacement:

\1my.domain.edu

INPUT:

https://www.google.com
http://not.the.best.example.org
http://another.bad.example.erewhon.edu
https://my.domain.com
https://your.page.domain.edu
http://good.domain.edu

OUTPUT:

https://my.domain.edu
http://my.domain.edu
http://my.domain.edu
https://my.domain.edu
https://your.page.domain.edu
http://good.domain.edu

Demo2

@Wh1T3h4Ck5: you do not need to escape the dot in character class, example: https://regex101.com/r/xRqrjS/1 it will be interpreted as a simple dot and not as any single character. — Allan, Jun 01 '18 at 01:29
Thanks for your answer. Although it works on regex101.com, it unfortunately doesn't work in my Netscaler device. There must be something else going on with Netscaler, and I'm beginning to suspect it really doesn't understand lookarounds. — JohnG, Jun 01 '18 at 19:25

score 0 · Answer 2 · answered Aug 16 '18 at 22:35

Look at the raw http payload and make sure the links are as you belive them to be in the actual payload..

hostname are usually a http header, protocol is very often not included in the page content etc.. install fiddler and observe the raw data.

Netscaler RegEx works as intended.

Further: make sure to deflate any compressed content prior to trying to rewrite it. if not the netscaler will try to match your rewrites with the compressed data / chunked content.

RegEx matching URLs that are NOT in my domain

2 Answers2