2

Need to replace a domain name on all the links on the page that are not images or pdf files. This would be a full html page received through a proxy service.

Example:
<a href="http://www.test.com/bla/bla">test</a><a href="/bla/bla"><img src="http://www.test.com" /><a href="http://www.test.com/test.pdf">pdf</a>
<a href="http://www.test.com/bla/bla/bla">test1</a>

Result:
<a href="http://www.newdomain.com/bla/bla">test</a><a href="/bla/bla"><img src="http://www.test.com" /><a href="http://www.test.com/test.pdf">pdf</a>
<a href="http://www.newdomain.com/bla/bla/bla">test1</a>
pbcomm
  • 53
  • 5

3 Answers3

2

If you are using .NET, I strongly suggest you to use HTML Agility Pack Direct parsing using regex can be very error prone. This questions is also similar to the post below.

What regex should I use to remove links from HTML code in C#?

Community
  • 1
  • 1
Fadrian Sudaman
  • 6,405
  • 21
  • 29
0

If the domain is http://www.example.com, the following should do the trick:

/http:\/\/www\.example\.com\S*(?!pdf|jpg|png|gif)\s/

This uses a negative lookahead to ensure that the regex matches a string only if the string does not contain pdf,png,jpg or gif at the specified position.

KJ Saxena
  • 21,452
  • 24
  • 81
  • 109
0

If none of your pdf urls have query parameters (like a.pdf?asd=12), the following code will work. It replaces only absolute and root-relative urls.

var links = document.getElementsByTagName("a");
var len = links.length;
var newDomain = "http://mydomain.com";
/**
 * Match absolute urls (starting with http) 
 * and root relative urls (starting with a `/`)
 * Does not match relative urls like "subfolder/anotherpage.html"
 * */
var regex = new RegExp("^(?:https?://[^/]+)?(/.*)$", "i");
//uncomment next line if you want to replace only absolute urls
//regex = new RegExp("^https?://[^/]+(/.*)$", "i");
for(var i = 0; i < len; i++)
{
  var link = links.item(i);
  var href = link.getAttribute("href");
  if(!href) //in case of named anchors
    continue;
  if(href.match(/\.pdf$/i)) //if pdf
    continue;
  href = href.replace(regex, newDomain + "$1");
  link.setAttribute("href", href);
}
Amarghosh
  • 58,710
  • 11
  • 92
  • 121