0

So I have a regular expression that will look into a string and match all of the relative anchor links, like: Leaderboard

It will not match where the href starts with HTTP or HTTPS.

Expression is:

<a.*?href="([^http]|[^https]).*?"[^<]

That part is good for now.

However, I cannot seem to figure out that after I get the match I want to replace just the href name with routerLink.

This:

<a href="/leaderboard">Leaderboard</a>

Becomes:

<a routerLink="/leaderboard">Leaderboard</a>

Note href is now routerLink.

There are 20+ matches so I can't simply do a replace with Leaderboard I need to keep the relative path the same. Literally the only thing in the matched string that gets replaced is href for routerLink and the value of that attribute stays as is.

This part is giving me issue!

Any ideas here?

Thanks

Jordan McDonald
  • 1,101
  • 3
  • 14
  • 24

2 Answers2

3

Eventhough other post is already accepted, I want to post an alternative without regex

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

foreach(var a in doc.DocumentNode.SelectNodes("//a[@href]"))
{
    var href = a.Attributes["href"];
    href.Remove();
    a.Attributes.Add("routerLink", href.Value);
}

var newHtml = doc.DocumentNode.OuterHtml;

A Linq query can also be used for the same XPath

foreach (var a in doc.DocumentNode.Descendants("a")
                     .Where(a => a.Attributes["href"] != null))
L.B
  • 114,136
  • 19
  • 178
  • 224
0

Your pattern doesn't do what you think - see what happens if you have href="halfway" with your pattern. The pattern you have says <a.*?href="([^http]|[^https]).*?"[^<] which breaks down as:

  1. Find the characters <a literally.
  2. Optionally skip as few characters as possible .*? to match
  3. Match the characters href=" literally
  4. Accept either a character that is not one of h, t, t, p or not one of h, t, t, p, s. ([^http]|[^https]) - note that character lists in square brackets [,] represent one character matched.
  5. Optionally skip as few characters as possible .*? to match
  6. Match the character " literally
  7. Match a character that is not <

While this online site won't handle all of .Net regular expressions, it shows the issues and explains how some of the matching operations work: https://regex101.com/r/raoCcA/1

This should work:

var pattern = @"href=(?=""(?!http|https))";

var ans = Regex.Replace(src, pattern, "routerLink=");

You can try to restrict to href inside a tags if necessary, but it starts to get too complicated for regular expressions:

var pattern = @"(?<=<a([^<>]|<!--|-->)+)href=(?=""(?!http|https))";
NetMage
  • 26,163
  • 3
  • 34
  • 55