2

I am having an issue with a string manipulation method I have written. The purpose of this method is to seek out link tags within a long string, and reformat their hrefs.

To give some context, I am parsing a large number of HTML files that were on a CD and collating the results in to xml files that are on a website in a separate project (I wrote this as part of a console app). The html files contain instructional text and this contains links that are relative to the files on the CD, and I need to change the hrefs to be relative to the website the information is going on.

The following code appears to work just fine if there is only one link tag, but pass it two, and the output is very messed up. Strangely, Visual Studio's Regex editor claims that the linkTag regex below is only matching the link tags, but when it comes round to replacing the links with the correct hrefs, it inserts link fragments at various points within the instructions string.

The reason for the additional regex's alphaDir is that I will eventually expand this method to correct links with different starting hrefs. We are talking about parsing thousands of html files, but this format is the most common by far.

I am at a bit of a loss on this one as I am very much a regex beginner, and wrote all of the regex's below myself, so any thoughts on any of these would be great too.

Typical Input string

Hold 1st <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back outward
  &amp; fingers forward, and put 2nd <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back forward
  &amp; fingers inward, with lower knuckle of its 4th finger on
  lower knuckle of 1st thumb; then slide 2nd hand forwards one
  hand's length.

The Method

static string instructions(string instructions)
    {
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex linkTag = new Regex(@"<a(.*?)>(.*?)<\/a>");
        Regex linkTagHtml = new Regex(@"<a(.*?)>|<\/a>");
        Regex hrefAttr = new Regex("href=\"(.)*?\"");
        Regex alphaDir = new Regex(@"/([a-z])?/");

        string signName = string.Empty;
        char alphaChar;
        string replacementLinkTag = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        MatchCollection matches = linkTag.Matches(instructions);

        foreach (Match link in matches)
        {
            Match alphaDirMatch = alphaDir.Match(link.Value.ToString());
            if (alphaDirMatch.Success)
            {
                Match hrefAttrMatch = hrefAttr.Match(link.Value.ToString());
                if (hrefAttrMatch.Success)
                {
                    signName = linkTagHtml.Replace(link.Value.ToString(), string.Empty).ToLower().Trim();
                    signName = signName.Replace(" ", "_");
                    alphaChar = signName[0];

                    replacementHref = "href=\"/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() +"&sign=" + signName + "\"";
                    replacementLinkTag = hrefAttr.Replace(link.Value.ToString(), replacementHref);

                    instructions = instructions.Remove(link.Index, link.Length);
                    instructions = instructions.Insert(link.Index, replacementLinkTag);
                }
            }
        }            

        return instructions;
    }

Current output string

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward &amp; finge<a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a>f="../f/fist_hand.html">FIST</a></strong> hand, back forward &amp; fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

Desired output string

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward &amp; fingers forward, and put 2nd <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back forward &amp; fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

The solution - Thanks for the suggestion Oded!

I used the HtmlAgilityPack to load the instructions string as html, and found the link tags storing these in a HtmlNodeCollection, looping over each and getting the href values, and doing the edits.

The code ended up looking like this for those interested:

static string instructions(string instructions)
    {
        char alphaChar;
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex alphaDir = new Regex(@"/([a-z])?/");
        string signName = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        HtmlDocument instr = new HtmlDocument();
        instr.LoadHtml(instructions);

        HtmlNodeCollection links = instr.DocumentNode.SelectNodes("//a");

        if (links != null)
        {
            foreach (HtmlNode link in links)
            {
                string href = link.GetAttributeValue("href", string.Empty);

                if (!string.IsNullOrWhiteSpace(href))
                {
                    Match alphaDirMatch = alphaDir.Match(href);

                    if (alphaDirMatch.Success)
                    {
                        signName = Regex.Replace(href, "(.)*?/([a-z])?/|(.html)?", string.Empty);
                        signName = signName.Replace(" ", "_");
                        alphaChar = signName[0];

                        replacementHref = "/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() + "&sign=" + signName;
                        link.SetAttributeValue("href", replacementHref);
                    }
                }
            }
        }

        instructions = instr.DocumentNode.InnerHtml.ToString();

        return instructions;
    }

2 Answers2

1

I recommend trying the HTML Agility Pack to parse and query your HTML documents.

Using RegEx can be rather brittle, and if the documents are not very uniform may be an approach that will not work - see this SO answer.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • I have actually used the Html Agility Pack to get the string in the first place. Whilst iterating through each html file the console app adds a new item within the xml I am creating. Thanks for the suggestion though. I hadn't considered using it a second time, and will give that a go, as it will probably be a lot more robust, and I am a complete regex newbie. I will let you know how I get on. – Alex Woodhead Nov 17 '11 at 12:37
  • See above for my implementation of your suggestion. Any comments welcome. Thanks – Alex Woodhead Nov 17 '11 at 14:48
  • @AlexWoodhead - Looks OK, though instead of `href != string.Empty` I would normally go with `!string.IsNullOrWhiteSpace(href)` (.NET 4.0). – Oded Nov 17 '11 at 14:56
0

In addition to @ Oded's answer you could do this with a simple XSL transform. Regex IMO is not the way to go here.

FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • You are assuming the HTML is also well formed XML. That can very well not be the case. – Oded Nov 17 '11 at 12:13
  • @Oded Yes agreed. But in case it is , you have one dependency less. – FailedDev Nov 17 '11 at 12:20
  • What's this issue with static dependencies? It's not like this dependency would need to change? I'd rather have a dependency on a library that works that a solution without a dependency that is broken. – Oded Nov 17 '11 at 12:23
  • @Oded Well, maybe he is not allowed to use it for example. Where I work, in order to get boost to be accepted we had to wait for months. And after that only some libraries were accepted. – FailedDev Nov 17 '11 at 12:31
  • Fair point, though there is no such indication in the question. – Oded Nov 17 '11 at 12:32