I am having an issue with a string manipulation method I have written. The purpose of this method is to seek out link tags within a long string, and reformat their hrefs.
To give some context, I am parsing a large number of HTML files that were on a CD and collating the results in to xml files that are on a website in a separate project (I wrote this as part of a console app). The html files contain instructional text and this contains links that are relative to the files on the CD, and I need to change the hrefs to be relative to the website the information is going on.
The following code appears to work just fine if there is only one link tag, but pass it two, and the output is very messed up. Strangely, Visual Studio's Regex editor claims that the linkTag regex below is only matching the link tags, but when it comes round to replacing the links with the correct hrefs, it inserts link fragments at various points within the instructions string.
The reason for the additional regex's alphaDir is that I will eventually expand this method to correct links with different starting hrefs. We are talking about parsing thousands of html files, but this format is the most common by far.
I am at a bit of a loss on this one as I am very much a regex beginner, and wrote all of the regex's below myself, so any thoughts on any of these would be great too.
Typical Input string
Hold 1st <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back outward
& fingers forward, and put 2nd <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back forward
& fingers inward, with lower knuckle of its 4th finger on
lower knuckle of 1st thumb; then slide 2nd hand forwards one
hand's length.
The Method
static string instructions(string instructions)
{
Regex Spaces = new Regex(@"\s+|\n|\r");
Regex linkTag = new Regex(@"<a(.*?)>(.*?)<\/a>");
Regex linkTagHtml = new Regex(@"<a(.*?)>|<\/a>");
Regex hrefAttr = new Regex("href=\"(.)*?\"");
Regex alphaDir = new Regex(@"/([a-z])?/");
string signName = string.Empty;
char alphaChar;
string replacementLinkTag = string.Empty;
string replacementHref = string.Empty;
instructions = Spaces.Replace(instructions, " ");
MatchCollection matches = linkTag.Matches(instructions);
foreach (Match link in matches)
{
Match alphaDirMatch = alphaDir.Match(link.Value.ToString());
if (alphaDirMatch.Success)
{
Match hrefAttrMatch = hrefAttr.Match(link.Value.ToString());
if (hrefAttrMatch.Success)
{
signName = linkTagHtml.Replace(link.Value.ToString(), string.Empty).ToLower().Trim();
signName = signName.Replace(" ", "_");
alphaChar = signName[0];
replacementHref = "href=\"/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() +"&sign=" + signName + "\"";
replacementLinkTag = hrefAttr.Replace(link.Value.ToString(), replacementHref);
instructions = instructions.Remove(link.Index, link.Length);
instructions = instructions.Insert(link.Index, replacementLinkTag);
}
}
}
return instructions;
}
Current output string
Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward & finge<a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a>f="../f/fist_hand.html">FIST</a></strong> hand, back forward & fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.
Desired output string
Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward & fingers forward, and put 2nd <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back forward & fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.
The solution - Thanks for the suggestion Oded!
I used the HtmlAgilityPack to load the instructions string as html, and found the link tags storing these in a HtmlNodeCollection, looping over each and getting the href values, and doing the edits.
The code ended up looking like this for those interested:
static string instructions(string instructions)
{
char alphaChar;
Regex Spaces = new Regex(@"\s+|\n|\r");
Regex alphaDir = new Regex(@"/([a-z])?/");
string signName = string.Empty;
string replacementHref = string.Empty;
instructions = Spaces.Replace(instructions, " ");
HtmlDocument instr = new HtmlDocument();
instr.LoadHtml(instructions);
HtmlNodeCollection links = instr.DocumentNode.SelectNodes("//a");
if (links != null)
{
foreach (HtmlNode link in links)
{
string href = link.GetAttributeValue("href", string.Empty);
if (!string.IsNullOrWhiteSpace(href))
{
Match alphaDirMatch = alphaDir.Match(href);
if (alphaDirMatch.Success)
{
signName = Regex.Replace(href, "(.)*?/([a-z])?/|(.html)?", string.Empty);
signName = signName.Replace(" ", "_");
alphaChar = signName[0];
replacementHref = "/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() + "&sign=" + signName;
link.SetAttributeValue("href", replacementHref);
}
}
}
}
instructions = instr.DocumentNode.InnerHtml.ToString();
return instructions;
}