0

I am having problems trying to get my regular expression right. Basically, I have an HTML string which contains various links. If the href attribute points to the same domain, or a domain in a list of approved domains, nothing is changed. Anything else should be changed to a redirect page with the original href as a URL parameter

for example, assume the following domain names are allowed:

domain1, domain2, domain3

and disallowed domains point to "/redirect.htm?url=..."

I would want the following string

<p>this is a paragraph with 
    <a href="/index.htm">link 1</a> and 
    <a href="http://domain4/page.htm">link 2</a> and 
    <a href="http://www.domain1.com">link3</a> and 
    <a href="http://www.domain5.com/directory/page.htm">link 4</a>
</p>

to be changed to:

<p>this is a paragraph with 
    <a href="/index.htm">link 1</a> and 
    <a href="/redirect.htm?url=domain4/page.htm">link 2</a> and 
    <a href="http://www.domain1.com">link3</a> and 
    <a href="/redirect.htm?url=www.domain5.com/directory/page.htm">link 4</a>
</p>

I should also point out that I am using IdocScript, a java based custom language for our content management system. I don't need help with that, just the regular expression.

the best I have come up with so far (which clearly doesn't work) is:

<$ regex = "href=\"(^(/|domain1|domain2|domain3)" $>
<$ regexReplaceAll( originalString, regex, 'href="/redirect.htm?url=$1') $>

Can anyone help?

Typhoon101
  • 2,063
  • 8
  • 32
  • 49
  • 1
    There are [problems with parsing HTML via regexes](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Joachim Sauer Nov 12 '13 at 10:11

2 Answers2

0
s/href="(?!=(\/|.*(domain1|domain2|domain3)))/href="\/redirect.htm?url=/

If we have a href, and it doesn't start with a slash and it doesn't contain domain1, domain2, or domain3, insert a redirect.

If needed, you can tighten up and look for specific subdomains as well:

s/href="(?!=(\/|http://((www|mobile|mysubdomain)\.)?(domain1|domain2|domain3)))/href="\/redirect.htm?url=/

Take a href=" that's not followed by [a slash] nor by [an optional subdomain and one of the listed domains], replace it by that same href=" + /redirect.htm?url=.

I've escaped the slashes, but that may not be necessary in your regex dialect of choice.

SQB
  • 3,926
  • 2
  • 28
  • 49
0

This one should suit your needs:

href="https?://((?:[^"](?<!\b(?:domain1|domain2|domain4)\b))+)"

Regular expression visualization

Replace by:

href="/redirect.htm?url=$1"

sp00m
  • 47,968
  • 31
  • 142
  • 252