2

I have a block of html that looks something like this;

<p><a href="docs/123.pdf">33</a></p>

There are basically hundreds of anchor links which I need to replace the href based on the anchor text. For example, I need to replace the link above with something like;

<a href="33.html">33</a>. 

I will need to take the value 33 and do a lookup on my database to find the new link to replace the href with.

I need to keep it all in the original html as above!

How can I do this? Help!

Joel Beckham
  • 18,254
  • 3
  • 35
  • 58
lordy1981
  • 188
  • 1
  • 12
  • updated so u can see html :-) – lordy1981 Jun 07 '11 at 22:37
  • do you have HTML, or valid XML ? – ulrichb Jun 07 '11 at 22:40
  • Are you dynamically generating this HTML (webserver) or do you just want to generate this file once / periodically with a commandline or windows executable? Also, do you need to "replace" them in an existing doc, or can you regenerate the whole document? – Louis Somers Jun 07 '11 at 23:23

5 Answers5

5

Although this doesn't answer your question, the HTML Agility Pack is a great tool for manipulating and working with HTML: http://html-agility-pack.net

It could at least make grabbing the values you need and doing the replaces a little easier.

Contains links to using the HTML Agility Pack: How to use HTML Agility pack

wp78de
  • 18,207
  • 7
  • 43
  • 71
Joel Beckham
  • 18,254
  • 3
  • 35
  • 58
  • 1
    I've used the agility pack with great success. The problem with regular expressions is that if the markup isn't well formed you could get misses or false hits. The HTML agility pack is exactly what the OP needs for this. – SRM Jun 07 '11 at 22:48
1

Consider using the the following rough algorithm.

using System;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

static class Program
{
  static void Main ()
  {
    string html = "<p><a href=\"docs/123.pdf\">33</a></p>"; // read the whole html file into this string.
    StringBuilder newHtml = new StringBuilder (html);
    Regex r = new Regex (@"\<a href=\""([^\""]+)\"">([^<]+)"); // 1st capture for the replacement and 2nd for the find
    foreach (var match in r.Matches(html).Cast<Match>().OrderByDescending(m => m.Index))
    {
       string text = match.Groups[2].Value;
       string newHref = DBTranslate (text);
       newHtml.Remove (match.Groups[1].Index, match.Groups[1].Length);
       newHtml.Insert (match.Groups[1].Index, newHref);
    }

    Console.WriteLine (newHtml);
  }

  static string DBTranslate(string s)
  {
    return "junk_" + s;
  }
}

(The OrderByDescending makes sure the indexes don't change as you modify the StringBuilder.)

agent-j
  • 27,335
  • 5
  • 52
  • 79
1

Slurp your HTML into an XmlDocument (your markup is valid, isn't it?) Then use XPath to find all the <a> tags with an href attribute. Apply the transform and assign the new value to the href attribute. Then write the XmlDocument out.

Easy!

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
0

Use a regexp to find the values and replace A regexp like "/<p><a herf=\"[^\"]+\">([^<]+)<\\/a><\\/p> to match and capture the ancor text

Daniel
  • 30,896
  • 18
  • 85
  • 139
0

So, what you want to do is generate the replacement string based on the contents of the match. Consider using one of the Regex.Replace overloads that take a MatchEvaluator. Example:

static void Main()
{
  Regex r = new Regex(@"<a href=""[^""]+"">([^<]+)");

  string s0 = @"<p><a href=""docs/123.pdf"">33</a></p>";
  string s1 = r.Replace(s0, m => GetNewLink(m));

  Console.WriteLine(s1);
}

static string GetNewLink(Match m)
{
  return string.Format(@"(<a href=""{0}.html"">{0}", m.Groups[1]);
}

I've actually taken it a step further and used a lambda expression instead of explicitly creating a delegate method.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156