How to determine which HTML is "code" and which is "display/content"?

Question

I want to use C# to parse HTML data.

If you think of every character of HTML data as being a bit: true = "html/code". false = "display/content". Then you would know which part of the HTML is the "code".

Let's use the following HTML example:

<a id="a1" class="c1" attr1="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

I want to do a C# String.Replace to find all instances of "a1" and replace it with "new1". I want to do a C# String.Replace to find all instances of "attr1" and replace it with "new2". But I only want the html "code" to be affected, and I want all "content" to NOT be changed. The desired result is:

<a id="new1" class="c1" new2="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

Note: the desired result has 2 other instances of "a1" that were not renamed. Note: the desired result has 2 other instances of "attr1" that were not renamed.

I can't find any existing library or software that would help in this effort.

EDIT1: HtmlAgilityPack might be an option. However, I'm still no closer to understanding how I could use it to differentiate between code and not-code?

EDIT2: Please keep in mind this question is simplified of my real problem as much as possible. Renaming things with and without quotes won't be the answer. I specifically need to figure out how to differentiate between code and not-code.

EDIT3: I have included "attr1" as a secondary String.Replace. I need to find both attributes AND values of attributes to replace. And I need to be able to distinguish between code and not-code.

Any suggestions?

Yeah use HtmlAgilityPack, it was designed for parsing HTML, it's even good at parsing mal formed html. — Ryan Mann, Dec 16 '15 at 03:18
I was thinking HtmlAgilityPack could be an answer, I have used it before. However, I'm still no closer to understanding how I could use it to differentiate between code and not-code? — SED, Dec 16 '15 at 03:24
@AEonAX unfortunately that won't do the trick. The whole point of this is I must be able to figure out how to differentiate between code and not-code. — SED, Dec 16 '15 at 04:14
Isn't this just string replacement of **attributes** vs. **innerText** of HTML elements? If that's the case, then the HtmlAgilityPack + replacement of attribute data only would do the trick. — Brendan Green, Dec 16 '15 at 04:17
HtmlAgilityPack + replacement of attribute data FTW (as Brendan said). DO NOT even think about [Regex](http://stackoverflow.com/a/1732454/2186591) — AEonAX, Dec 16 '15 at 04:41
I agree that Regex is NOT an option. When you solve 1 problem with Regex, you now have 2 problems. Per EDIT3, how would you resolve this? — SED, Dec 16 '15 at 04:50
So you remove the existing attribute and add a new one in its place with the new name and value (where the value may also be replaced). This appears to be a constantly evolving question - I think there's been enough input to solve this problem - have you made any progress? — Brendan Green, Dec 16 '15 at 21:06

score 2 · Accepted Answer · answered Dec 16 '15 at 21:19

Following the comments made on this post, I came up with the following:

void Main()
{
    var html = "<a id=\"attr1\" class=\"c1\" attr1=\"x\" attr2=\"y\">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>";

    var res = Replace(html, "attr1", "attrA");
}

public string Replace(string html, string oldval, string newval)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);

    foreach (var n in doc.DocumentNode.ChildNodes)
    {
        foreach (var a in n.Attributes)
        {
            if (a.Value.Equals(oldval))
            {
                a.Value = newval;
            }

            if (a.Name.Equals(oldval))
            {
                a.Name = newval;
            }
        }
    }

    return doc.DocumentNode.OuterHtml;
}

Given the input:

<a id="attr1" class="c1" attr1="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

The output is:

<a id="attrA" class="c1" attra="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

This should meet the current requirements.

How to determine which HTML is "code" and which is "display/content"?

1 Answers1